[GitHub] madlib pull request #344: Add kd-tree option to knn.

2019-01-07 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/344

Add kd-tree option to knn.

This commits add the a partial kd-tree implementation to be used for knn
operations. This function is designed to work independently in case some
future modules require its functionality.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/kd-tree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #344


commit 05ed2e172070f0d49baf8b04aed5a3ba42c1f418
Author: Orhan Kislal 
Date:   2018-12-06T08:08:33Z

Add kd-tree option to knn.

This commits add the a partial kd-tree implementation to be used for knn
operations. This function is designed to work independently in case some
future modules require its functionality.




---


[GitHub] madlib pull request #343: Linear Regression: Support for JSON and special ch...

2019-01-03 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/343#discussion_r245023325
  
--- Diff: src/ports/postgres/modules/regress/linear.py_in ---
@@ -185,10 +221,12 @@ def _validate_args(schema_madlib, source_table, 
out_table, dependent_varname,
 if grouping_cols is not None:
 _assert(grouping_cols != '',
 "Linregr error: Invalid grouping columns name!")
+# grouping columns can be a valid expression as well, for eg.
+# a json expression (data->>'id'), so commenting this part.
 grouping_list = _string_to_array_with_quotes(grouping_cols)
-_assert(columns_exist_in_table(
-source_table, grouping_list, schema_madlib),
-"Linregr error: Grouping column does not exist!")
+#_assert(columns_exist_in_table(
--- End diff --

We should clean up these comments before the merge.


---


[GitHub] madlib pull request #343: Linear Regression: Support for JSON and special ch...

2019-01-03 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/343#discussion_r245022983
  
--- Diff: src/ports/postgres/modules/regress/linear.py_in ---
@@ -134,11 +170,11 @@ def linregr_train(schema_madlib, source_table, 
out_table,
   'linregr'::varchar  as method
 , '{source_table}'::varchar   as source_table
 , '{out_table}'::varchar  as out_table
-, '{dependent_varname}'::varchar  as 
dependent_varname
-, '{independent_varname}'::varcharas 
independent_varname
+, $${dependent_varname}$$::varchar  as 
dependent_varname
+, $${independent_varname}$$::varcharas 
independent_varname
 , {num_rows_processed}::integer   as 
num_rows_processed
 , {num_rows_skipped}::integer as 
num_missing_rows_skipped
-, {grouping_col}::textas grouping_col
+, $${grouping_col}$$::textas 
grouping_col
--- End diff --

These additional quotes around the grouping columns break the PMML tests.


---


[GitHub] madlib pull request #339: Build: Add PG11 Support

2018-11-29 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/339#discussion_r237604025
  
--- Diff: src/ports/postgres/modules/kmeans/kmeans.sql_in ---
@@ -766,15 +766,30 @@ BEGIN
 
 proc_fn_dist := fn_dist
 || '(DOUBLE PRECISION[], DOUBLE PRECISION[])';
-IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE
-FROM pg_proc WHERE oid = proc_fn_dist) THEN
-RAISE EXCEPTION 'Kmeans error: Distance function has wrong 
signature or is not a simple function.';
-END IF;
-proc_agg_centroid := agg_centroid || '(DOUBLE PRECISION[])';
-IF (SELECT prorettype != 'DOUBLE PRECISION[]'::regtype OR proisagg = 
FALSE
-FROM pg_proc WHERE oid = proc_agg_centroid) THEN
-RAISE EXCEPTION 'Kmeans error: Mean aggregate has wrong signature 
or is not an aggregate.';
+
+-- Handle PG11 pg_proc table changes
--- End diff --

I tried this method but it requires casting `regprocedure` to `varchar`. 
This is allowed on PG versions after 8.3. On earlier versions, we have to use 
`textin` function. This means we will need another if check for GPDB4.3.


---


[GitHub] madlib pull request #339: Build: Add PG11 Support

2018-11-26 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/339

Build: Add PG11 Support

JIRA: MADLIB-1283

PG11 support required a number of minor changes in the code.
- Change TRUE/FALSE to true/false
- Use TupleDescAttr function instead of direct access.
- Use prokind column instead of proisagg.

We also added a function to check if the PG version is earlier than 11
as well as the necessary cmake files.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib build/pg-11-support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/339.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #339


commit b63303c5bbcebeb82ab03694e4b3dade7d1827ab
Author: Orhan Kislal 
Date:   2018-11-19T16:02:53Z

Build: Add PG11 Support

JIRA: MADLIB-1283

PG11 support required a number of minor changes in the code.
- Change TRUE/FALSE to true/false
- Use TupleDescAttr function instead of direct access.
- Use prokind column instead of proisagg.

We also added a function to check if the PG version is earlier than 11
as well as the necessary cmake files.




---


[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation

2018-11-09 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/337#discussion_r232282360
  
--- Diff: src/madpack/diff_udo.sql ---
@@ -0,0 +1,81 @@

+--
+-- Licensed to the Apache Software Foundation (ASF) under one
+-- or more contributor license agreements.  See the NOTICE file
+-- distributed with this work for additional information
+-- regarding copyright ownership.  The ASF licenses this file
+-- to you under the Apache License, Version 2.0 (the
+-- "License"); you may not use this file except in compliance
+-- with the License.  You may obtain a copy of the License at
+
+--   http://www.apache.org/licenses/LICENSE-2.0
+
+-- Unless required by applicable law or agreed to in writing,
+-- software distributed under the License is distributed on an
+-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+-- KIND, either express or implied.  See the License for the
+-- specific language governing permissions and limitations
+-- under the License.

+--
+
+SET client_min_messages to ERROR;
+\x on
+
+CREATE OR REPLACE FUNCTION filter_schema(argstr text, schema_name text)
+RETURNS text AS $$
+if argstr is None:
+return "NULL"
+return argstr.replace(schema_name + ".", '')
+$$ LANGUAGE plpythonu;
+
+CREATE OR REPLACE FUNCTION alter_schema(argstr text, schema_name text)
+RETURNS text AS $$
+if argstr is None:
+return "NULL"
+return argstr.replace(schema_name + ".", 'schema_madlib.')
+$$ LANGUAGE plpythonu;
+
+
+CREATE OR REPLACE FUNCTION get_udos(table_name text, schema_name text,
+ type_filter text)
+RETURNS VOID AS
+$$
+import plpy
+
+plpy.execute("""
+create table {table_name} AS
+SELECT *
+FROM (
+SELECT n.nspname AS "Schema",
+   o.oprname AS name,
+   filter_schema(o.oprcode::text, '{schema_name}') AS 
oprcode,
+   alter_schema(pg_catalog.format_type(o.oprleft, 
NULL), '{schema_name}') AS oprleft,
+   alter_schema(pg_catalog.format_type(o.oprright, 
NULL), '{schema_name}') AS oprright,
+   alter_schema(pg_catalog.format_type(o.oprresult, 
NULL), '{schema_name}') AS rettype
+FROM pg_catalog.pg_operator o
+LEFT JOIN pg_catalog.pg_namespace n ON n.oid = 
o.oprnamespace
+WHERE n.nspname OPERATOR(pg_catalog.~) '^({schema_name})$'
--- End diff --

I use the `\do madlib.*` command of `psql` as a basis. The corresponding 
query (you can get this if you start with `psql -E`) uses this particular 
phrase to get all of the operators of a particular schema. 
Basically, this regex looks at the schema name(n.nspname) and filters that 
don't start (^) and end ($) with madlib schema name. 


---


[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation

2018-11-09 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/337#discussion_r232279216
  
--- Diff: src/madpack/create_changelist.py ---
@@ -237,6 +325,13 @@
 print "Something went wrong! The changelist might be wrong/corrupted."
 raise
 finally:
-os.system("rm -f /tmp/madlib_tmp_nm.txt /tmp/madlib_tmp_udf.txt "
-  "/tmp/madlib_tmp_udt.txt /tmp/madlib_tmp_cl.yaml "
--- End diff --

Nice catch, it should still be removed.


---


[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation

2018-10-26 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/337

Madpack: Add UDO and UDOC automation

JIRA: MADLIB-1281

- Add scripts for detecting changed/dropped UDOs and UDOCs.
- Expand the create_changelist.py file to consume these scripts and
create changelists with these fields filled if necessary.
- Fix the update_util.py to use the correct dictionary key.
- Add drop operator class command to the svac.sql_in to make sure the
old class is removed before creating the updated one.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib madpack/complete-changelist

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/337.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #337


commit 09c3bb2e55417309a45f0729f370920273be40b4
Author: Orhan Kislal 
Date:   2018-10-24T12:55:34Z

Madpack: Add UDO and UDOC automation

JIRA: MADLIB-1281

- Add scripts for detecting changed/dropped UDOs and UDOCs.
- Expand the create_changelist.py file to consume these scripts and
create changelists with these fields filled if necessary.
- Fix the update_util.py to use the correct dictionary key.
- Add drop operator class command to the svac.sql_in to make sure the
old class is removed before creating the updated one.




---


[GitHub] madlib pull request #333: Update version numbers to 1.16-dev

2018-10-23 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/333


---


[GitHub] madlib pull request #333: Update version numbers to 1.16-dev

2018-10-19 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/333

Update version numbers to 1.16-dev



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib release/new-version

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #333


commit 5f4fdce8bf976914d7b929817ca5fbff0f1029ec
Author: Orhan Kislal 
Date:   2018-10-19T17:23:42Z

Update version numbers to 1.16-dev




---


[GitHub] madlib issue #332: Update Dockerfile to use ubuntu 16.04

2018-10-19 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/332
  
Please do not merge this PR until we change the version to 1.16-dev. 


---


[GitHub] madlib issue #331: Build: Include preflight and postflight scripts for mac

2018-10-09 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/331
  
Good catch +1


---


[GitHub] madlib issue #329: Release/prep 1.15.1

2018-10-04 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/329
  
Thanks for the comments. @fmcquillan99 Regarding MADLIB-1171. The following 
commit about AO tables references this JIRA even though they are not related 
https://github.com/madlib/madlib/commit/3db98babe3326fb5e2cd16d0639a2bef264f4b04.
 It is very strange because the JIRA activity does not show that commit but it 
has no trouble catching the mention your comment. 


---


[GitHub] madlib pull request #325: Madpack/ic func schema

2018-10-04 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/325


---


[GitHub] madlib pull request #330: Margins: Copy summary table instead of renaming

2018-10-03 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/330

Margins: Copy summary table instead of renaming

JIRA: MADLIB-1274

Margins summary table gets dropped since its schema remains pg_temp.
This commit fixed the issue by copying the contents instead of renaming.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/margins-summary

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/330.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #330


commit 67bf28d9c196b969a925837eab0edda0de814193
Author: Orhan Kislal 
Date:   2018-09-24T14:34:33Z

Margins: Copy summary table instead of renaming

JIRA: MADLIB-1274

Margins summary table gets dropped since its schema remains pg_temp.
This commit fixed the issue by copying the contents instead of renaming.




---


[GitHub] madlib pull request #329: Release/prep 1.15.1

2018-10-02 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/329

Release/prep 1.15.1



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib release/prep-1.15.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/329.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #329


commit d12a18bea42e914e0f7e4d550317537ce58daca3
Author: Orhan Kislal 
Date:   2018-09-28T07:21:40Z

Build: Change version to 1.15.1

commit 8fb4f162a409e0ecdbd4b80b8ce3ff1bd050b90c
Author: Orhan Kislal 
Date:   2018-09-28T10:30:30Z

Update RELEASE_NOTES

commit 6a8a3395761cae401b5b4b5bfc36259cc14db648
Author: Orhan Kislal 
Date:   2018-09-28T13:33:37Z

Add 1.15.1 changelist and fix upgrade util.

Upgrade was failing when functions without any arguments were added to
the changelist. This commit fixes the issue by setting the argument list
to empty string.




---


[GitHub] madlib issue #325: Madpack/ic func schema

2018-10-02 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/325
  
Another alternative is to integrate the madlib deployment into the IC/DC. 
What I mean is similar to how PostGIS runs its unit tests. IC/DC creates a 
temporary database/schema, deploys the MADlib over there, runs the tests as 
usual and then removes the temporary database/schema. This will inevitably 
increase the running time of IC/DC but I believe it will be more stable. Since 
we assume that the user might be using the madlib deployment schema, it is also 
possible that they drop and/or recreate UDTs, UDFs and UDAs. Our IC/DC does not 
account for a case like that and will probably fail. 

It will also mean that a user can run IC/DC before they deploy it to their 
target database. I would assume most users are already using a similar workflow 
(temp database -> deploy MADlib -> run IC -> deploy MADlib on actual target)


---


[GitHub] madlib issue #325: Madpack/ic func schema

2018-09-30 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/325
  
> Also note: madpack is supposed to drop the file-specific schema after 
executing each file (see function _execute_per_module_install_dev_check_algo). 
Hence, common table names in independent tests within same module are not 
supposed to conflict with each other (if you've seen this happen then it 
requires investigation).

Oh, I see. The naming convention is somewhat strange. Under the modules 
folder, we have a bunch of folders (graph, etc.) but they are not actually 
modules, the individual sql files are. The madpack code uses the variable 
`module` for the folder name which further muddles the naming. This means 
`madlib_installcheck_graph` schema will be created and dropped for each module 
in graph. We might want to change it to reflect the actual module name. 

I think the reused name issue might be more widespread than we think. I am 
pretty sure `abalone` and `houses` datasets are used in multiple modules.

I think removing the `DROP TABLE` statements might work as @iyerr3 
suggested. I'll keep the PR open for now to keep the conversation open and 
start working on a different branch.


---


[GitHub] madlib issue #325: Madpack/ic func schema

2018-09-28 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/325
  
I agree this is not a great solution. Casting the operators makes it 
especially awkward to use. However, we have to consider the following case. If 
a module has multiple test files like `graph` and if they re-use the same table 
names like `vertex`, then we have to drop them before re-creating. 


---


[GitHub] madlib issue #325: Madpack/ic func schema

2018-09-27 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/325
  
@iyerr3 @jingyimei I'll put the following in the commit message before 
merging.
This commit fixes the following potential issue.  
1. User deploys MADlib on the schema `madlib1`.
2. User creates a table named `vertex` in the `madlib1` schema.
3. User runs install-check.
4. The install check creates a new role and a new schema for each module in 
the database.
5. The install check sets the `search_path` to 
`madlib_installcheck_, madlib1`.
6. The graph IC calls `DROP TABLE IF EXISTS vertex` and fails because the 
vertex table does exist but it is not owned by the install-check role.

This commit removes the madlib installation schema from the search path so 
that it only uses its own schema. This means every madlib function call, type 
and operator has to be called directly using the madlib schema name. 

One alternative solution is eliminating the `drop table` commands from the 
tests but that would require a very complicated refactoring work since most of 
the tests are written to reuse the same output table names. Another alternative 
is changing the `drop table` and `create table` commands to use the newly 
created test schema. However, this is very tricky to test; if a developer 
forgets to put the schema name, the test will still work unless she also 
creates a table of the same name in the madlib deployment schema.


---


[GitHub] madlib pull request #325: Madpack/ic func schema

2018-09-27 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/325

Madpack/ic func schema

IC/DC was prone to failure if the user were creating tables in the
madlib schema. This commit fixes the potential issue by removing the
madlib from the search path and adding the madlib_schema keyword for
every function, type and operator that is created by madlib.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib madpack/ic-func-schema

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/325.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #325


commit c762817926abc36191cd27d77c4a2ed7b2ec8151
Author: Orhan Kislal 
Date:   2018-09-27T13:30:41Z

IC/DC: Remove madlib schema IC/DC

IC/DC was prone to failure if the user were creating tables in the
madlib schema. This commit fixes the potential issue by removing the
madlib from the search path and adding the madlib_schema keyword for
every function, type and operator that is created by madlib.

commit dd1389639232dce64e359cef923941103e37f3a6
Author: Orhan Kislal 
Date:   2018-09-27T13:56:36Z

Fix double schema errors

commit 325d70c1f9d24b7abb390270d2d2986e86cabba4
Author: Orhan Kislal 
Date:   2018-09-27T16:18:18Z

Revert documentation changes




---


[GitHub] madlib pull request #324: Madpack/ic func schema

2018-09-27 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/324


---


[GitHub] madlib pull request #324: Madpack/ic func schema

2018-09-27 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/324

Madpack/ic func schema

IC/DC was prone to failure if the user were creating tables in the
madlib schema. This commit fixes the potential issue by removing the
madlib from the search path and adding the madlib_schema keyword for
every function, type and operator that is created by madlib.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib madpack/ic-func-schema

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/324.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #324


commit c762817926abc36191cd27d77c4a2ed7b2ec8151
Author: Orhan Kislal 
Date:   2018-09-27T13:30:41Z

IC/DC: Remove madlib schema IC/DC

IC/DC was prone to failure if the user were creating tables in the
madlib schema. This commit fixes the potential issue by removing the
madlib from the search path and adding the madlib_schema keyword for
every function, type and operator that is created by madlib.

commit dd1389639232dce64e359cef923941103e37f3a6
Author: Orhan Kislal 
Date:   2018-09-27T13:56:36Z

Fix double schema errors




---


[GitHub] madlib pull request #322: Madpack devcheck schema

2018-09-27 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/322


---


[GitHub] madlib pull request #322: Madpack devcheck schema

2018-09-24 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/322

Madpack devcheck schema



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib madpack-devcheck-schema

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/322.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #322


commit 33e4634b5d11b9ae29de3dece5b2ba96c7257aed
Author: Orhan Kislal 
Date:   2018-09-24T14:33:05Z

IC: Add schema for test cases

IC/DC was prone to failure if the user were creating tables in the
madlib schema. This commit fixes the potential issue by adding the
madlib_test_schema in the madpack and test cases.

commit 22c3e31560aa50601ea185bdc3da613efd6d40b7
Author: Orhan Kislal 
Date:   2018-09-24T14:34:33Z

Margins: Copy summary table instead of renaming

JIRA: MADLIB-1274

Margins summary table gets dropped since its schema remains pg_temp.
This commit fixed the issue by copying the contents instead of renaming.




---


[GitHub] madlib issue #321: RF: Increase the dataset size of dev-check test

2018-09-21 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/321
  
I hope so. I tried over 3000 runs with this fix in and did not get a single 
error.


---


[GitHub] madlib pull request #321: RF: Increase the dataset size of dev-check test

2018-09-21 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/321

RF: Increase the dataset size of dev-check test



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib rf-devc-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/321.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #321


commit d7922ef9ba06fe2b25868f261c365307b9d97141
Author: Orhan Kislal 
Date:   2018-09-21T13:02:19Z

RF: Increase the dataset size of dev-check test




---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r217953051
  
--- Diff: src/madpack/create_changelist.py ---
@@ -0,0 +1,239 @@
+#!/usr/bin/python
+# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+
+# Create changelist for any two branches/tags
+
+# Prequisites:
+# The old version has to be installed in the "madlib_old_vers" schema
+# The new version has to be installed in the "madlib" (default) schema
+# Two branches/tags must exist locally (run 'git fetch' to ensure you have 
the latest version)
+# The current branch does not matter
+
+# Usage (must be executed in the src/madpack directory):
+# python create_changelist.py
+# If you are using the master branch, plase make sure to edit the 
branch/tag in the output file
+
+# Example (should be equivalent to changelist_1.13_1.14.yaml):
+# python create_changelist.py madlib rel/v1.13 rel/v1.14 chtest1.yaml
+
+import sys
+import os
+
+database = sys.argv[1]
+old_vers = sys.argv[2]
+new_vers = sys.argv[3]
+ch_filename = sys.argv[4]
+
+if os.path.exists(ch_filename):
+print "{0} already exists".format(ch_filename)
+raise SystemExit
+
+err1 = os.system("""psql {0} -l > /dev/null""".format(database))
+if err1 != 0:
+print "Database {0} does not exist".format(database)
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib_old_vers.version()" > 
/dev/null
+ """.format(database))
+if err1 != 0:
+print "MADlib is not installed in the madlib_old_vers schema. Please 
refer to the Prequisites."
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib.version()" > /dev/null
+ """.format(database))
+if err1 != 0:
+print "MADlib is not installed in the madlib schema. Please refer to 
the Prequisites."
+raise SystemExit
+
+print "Creating changelist {0}".format(ch_filename)
+os.system("rm -f /tmp/madlib_tmp_nm.txt /tmp/madlib_tmp_udf.txt 
/tmp/madlib_tmp_udt.txt")
+try:
+# Find the new modules using the git diff
+err1 = os.system("git diff {old_vers} {new_vers} --name-only 
--diff-filter=A > /tmp/madlib_tmp_nm.txt".format(**locals()))
+if err1 != 0:
+print "Git diff failed. Please ensure that branches/tags are 
fetched."
+raise SystemExit
+
+f = open("/tmp/madlib_tmp_cl.yaml", "w")
+f.write(
+"""# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+""")
+
+f.write(
 

[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation

2018-09-14 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/318
  
@kaknikhil I checked the 1.11 -> 1.12 scenario. The tool is missing 
the`tree_train` and `forest_train` entries. The change seems to be the removal 
of `surrogate_params` and the addition of `null_handling_params`. Since both of 
them are `text` type, it does not get picked up by the `diff_udf.sql` script. 


---


[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation

2018-09-13 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/318
  
Thanks for the review @kaknikhil .

1. We had an issue with the knn help messages during one of the releases 
because they didn't show up on the `diff_udf.sql` output. IIRC, the problem was 
that the same function was converted from a pure SQL function to a plpython 
function. I don't have a quick solution to identify such cases.
2. This indentation should be OK.
3. We can check if the `new_vers` tag/branch exists and use `master` in its 
place. Alternatively, we can change the comment to avoid using the new version. 
It does not have any functional value.


---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-13 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r217361584
  
--- Diff: src/madpack/create_changelist.py ---
@@ -0,0 +1,229 @@
+#!/usr/bin/python
+# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+
+# Create changelist for any two branches/tags
+
+# Prequisites:
+# The old version has to be installed in the "madlib_old_vers" schema
+# The new version has to be installed in the "madlib" (default) schema
+# Two branches/tags must exist locally (run 'git fetch' to ensure you have 
the latest version)
+# The current branch does not matter
+
+# Usage (must be executed in the src/madpack directory):
+# python create_changelist.py
+
+# Example (should be equivalent to changelist_1.13_1.14.yaml):
+# python create_changelist.py madlib rel/v1.13 rel/v1.14 chtest1.yaml
+
+import sys
+import os
+
+database = sys.argv[1]
+old_vers = sys.argv[2]
+new_vers = sys.argv[3]
+ch_filename = sys.argv[4]
+
+if os.path.exists(ch_filename):
+print "{0} already exists".format(ch_filename)
+raise SystemExit
+
+err1 = os.system("""psql {0} -l > /dev/null""".format(database))
+if err1 != 0:
+print "Database {0} does not exist".format(old_vers)
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib_old_vers.version()" > 
/dev/null
+ """.format(database))
+if err1 != 0:
+print "Schema {0} does not exist".format(old_vers)
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib.version()" > /dev/null
+ """.format(database))
+if err1 != 0:
+print "Schema {0} does not exist".format(new_vers)
+raise SystemExit
+
--- End diff --

That would be tricky for branches. `madlib.version()` gives a  on a branch (and just a tag on tags). We can use `git 
rev-parse --short HEAD` to get the commit tag but it seems complicating the 
code for handholding the developer that uses it. 


---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-13 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r217358090
  
--- Diff: src/madpack/diff_udf.sql ---
@@ -11,71 +11,13 @@ RETURNS text AS $$
 $$ LANGUAGE plpythonu;
 
 
-CREATE OR REPLACE FUNCTION get_functions(schema_name text)
+CREATE OR REPLACE FUNCTION get_functions(table_name text, schema_name text,
+ type_filter text)
 RETURNS VOID AS
 $$
 import plpy
 plpy.execute("""
-CREATE TABLE functions_madlib_new_version AS
-SELECT
-"schema", "name", filter_schema("retype", 'madlib') retype,
-filter_schema("argtypes", 'madlib') argtypes, "type"
-FROM
-(
-
-SELECT n.nspname as "schema",
--- End diff --

It was duplicated code. I just added another parameter to the function and 
reused it.


---


[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation

2018-09-11 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/318
  
Thanks for the comments @kaknikhil. I addressed them and added the support 
for return type based dependencies. It would be great if you could take another 
look at the latest version. 


---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-11 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r216567704
  
--- Diff: src/madpack/diff_udf.sql ---
@@ -142,9 +142,12 @@ DROP TABLE IF EXISTS functions_madlib_new_version;
 SELECT get_functions('madlib_old_vers');
 
 SELECT
+type,
 --'\t-' || name || ':' || '\n\t\t-rettype: ' || retype || 
'\n\t\t-argument: ' || argtypes
-'- ' || name || ':' || '\nrettype: ' || retype || '\n  
  argument: ' || argtypes AS "Dropped UDFs"
-, type
+'- ' || name || ':' AS "Dropped UDF part1",
--- End diff --

`rettype` was not removed, I just changed the column names to `Dropped UDF 
part1` format so that I can easily parse them. The newline characters were just 
complicating the output for no reason.


---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-11 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r216566555
  
--- Diff: src/madpack/create_changelist.py ---
@@ -0,0 +1,132 @@
+#!/usr/bin/python
+# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+
+# Create changelist for any two branches
+
+# Prequisites:
+# The old version has to be installed in the "madlib_old_vers" schema
+# The new version has to be installed in the "madlib" (default) schema
+# Two branches must exist locally (run 'git fetch' to ensure you have the 
latest version)
--- End diff --

They can be branches as well as tags.


---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-04 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/318

Madpack: Add a script for automating changelist creation



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib madpack/auto-changelist

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/318.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #318


commit 28528b21ecb9b53d03de7683ff7e8db2bb409675
Author: Orhan Kislal 
Date:   2018-09-04T12:50:50Z

Madpack: Add a script for automating changelist creation




---


[GitHub] madlib pull request #311: Vector to columns: added support for splitting arr...

2018-08-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/311#discussion_r210687181
  
--- Diff: 
src/ports/postgres/modules/utilities/test/unit_tests/test_transform_vec_cols.py_in
 ---
@@ -125,23 +125,24 @@ class Vec2ColsTestSuite(unittest.TestCase):
 
 def test_get_names_for_split_output_cols_feature_names_none(self):
 self.plpy_mock_execute.return_value = [{"n_x": 3}]
-new_cols = 
self.subject.get_names_for_split_output_cols(self.default_source_table, 
'foobar', None)
+new_cols = 
self.subject.get_names_for_split_output_cols(self.default_source_table, 
'foobar')
 self.assertEqual(['f1', 'f2', 'f3'], new_cols)
 
-def test_get_names_for_split_output_cols_feature_names_not_none(self):
-self.plpy_mock_execute.return_value = [{"n_x": 3}]
-new_cols = 
self.subject.get_names_for_split_output_cols(self.default_source_table, 
'foobar', ['a', 'b', 'c'])
-self.assertEqual(['a', 'b', 'c'], new_cols)
+# def 
test_get_names_for_split_output_cols_feature_names_not_none(self):
--- End diff --

We should remove these commented lines.


---


[GitHub] madlib pull request #302: Remove online examples

2018-08-15 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/302


---


[GitHub] madlib pull request #306: Ubuntu support

2018-08-06 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/306#discussion_r208064138
  
--- Diff: deploy/DEB/postinst ---
@@ -0,0 +1,46 @@
+#!/bin/sh
+#
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Source debconf library.
+. /usr/share/debconf/confmodule
+
+MADLIB_VERSION="1.15-dev"
+MADLIB_INSTALL_PATH="InstallPathNotFound"
+
+# Fetching configuration from debconf
+db_get madlib/installpath
+MADLIB_INSTALL_PATH=$RET
+
+# Remove existing soft links
--- End diff --

It seems this is not as trivial as it seems. CPackDeb takes a few extra 
files (`postinst, postrm` etc.) but not any arbitrary file. We would have to 
put the common file into a different folder such as `/usr/share/madlib`, since 
we have to access this file after madlib is removed.


---


[GitHub] madlib issue #306: Ubuntu support

2018-08-06 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/306
  
Thanks for the comments, Rahul. I am currently testing the new changes, no 
need to test these temporary commits.


---


[GitHub] madlib pull request #306: Ubuntu support

2018-08-02 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/306

Ubuntu support



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib ubuntu-support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/306.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #306


commit be33d4df22e13a80f5d4824d44d0b5d45bd3e892
Author: Nandish Jayaram 
Date:   2018-07-25T00:27:59Z

Ubuntu support for MADlib

JIRA: MADLIB-1256

Adds support for compiling on Ubuntu as well as creating a deb package.

Co-authored-by: Domino Valdano 
Co-authored-by: Jingyi Mei
Co-authored-by: Orhan Kislal 

commit 1b9deaf88c82e98a3549adb5eac3ce93bdc485fd
Author: Orhan Kislal 
Date:   2018-08-01T23:47:57Z

rename cmakelists

commit c6045c63c155fca153f2d245654c1bf2dbb66676
Author: Orhan Kislal 
Date:   2018-08-02T17:36:43Z

Fix licenses




---


[GitHub] madlib pull request #305: Change the version to 1.15 and add changelist

2018-08-02 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/305

Change the version to 1.15 and add changelist

Co-authored-by: Nandish Jayaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 1-15-changelist

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/305.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #305


commit 588b1268ed61f6f06181c58f6de9902e0d95d291
Author: Orhan Kislal 
Date:   2018-08-01T21:49:46Z

Change the version to 1.15 and add changelist

Co-authored-by: Nandish Jayaram 




---


[GitHub] madlib pull request #304: MLP: Add test for iterations_per_step

2018-08-01 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/304


---


[GitHub] madlib pull request #304: MLP: Add test for iterations_per_step

2018-07-31 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/304

MLP: Add test for iterations_per_step



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/mlp-grouping-ans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/304.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #304


commit b87199d540084da3d72e62e410a34e67b8159b45
Author: Orhan Kislal 
Date:   2018-07-30T22:14:22Z

MLP: Add test for iterations_per_step




---


[GitHub] madlib pull request #303: Add test for iterations_per_step

2018-07-31 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/303

Add test for iterations_per_step



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/mlp-grouping-ans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/303.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #303


commit 27fa25ae6ca4c8cc167c475c2f368c53fafabd20
Author: Orhan Kislal 
Date:   2018-07-30T22:14:22Z

Add test for iterations_per_step




---


[GitHub] madlib pull request #303: Add test for iterations_per_step

2018-07-31 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/303


---


[GitHub] madlib pull request #302: Remove online examples

2018-07-30 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/302

Remove online examples

JIRA: MADLIB-1260

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib remove-online-examples

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/302.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #302


commit 7a857db9afbd32bbb3df8747742a754bf8998ab8
Author: Nandish Jayaram 
Date:   2018-07-27T23:18:33Z

Remove examples from online docs for subset of modules

commit 5caf22418e3ac273c0466597a5c8359d6f5ceaec
Author: Orhan Kislal 
Date:   2018-07-28T00:08:04Z

Remove on-line examples




---


[GitHub] madlib pull request #291: Feature: Vector-Column Transformations

2018-07-27 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/291#discussion_r205914673
  
--- Diff: src/ports/postgres/modules/utilities/transform_vec_cols.py_in ---
@@ -0,0 +1,495 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+from control import MinWarning
+from internal.db_utils import is_col_1d_array
+from internal.db_utils import quote_literal
+from utilities import _assert
+from utilities import add_postfix
+from utilities import ANY_ARRAY
+from utilities import is_valid_psql_type
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from validate_args import is_var_valid
+from validate_args import explicit_bool_to_text
+from validate_args import get_cols
+from validate_args import get_cols_and_types
+from validate_args import get_expr_type
+from validate_args import input_tbl_valid
+from validate_args import output_tbl_valid
+from validate_args import table_exists
+
+class vec_cols_helper:
+def __init__(self):
+self.all_cols = None
+
+def get_cols_as_list(self, cols_to_process, source_table=None, 
exclude_cols=None):
+"""
+Get a list of columns based on the value of cols_to_process
+Args:
+@param cols_to_process: str, Either a * or a comma-separated 
list of col names
+@param source_table: str, optional. Source table name
+@param exclude_cols: str, optional. Comma-separated list of 
the col(s) to exclude
+ from the source table, only used if 
cols_to_process is *
+Returns:
+A list of column names (or an empty list)
+"""
+# If cols_to_process is empty/None, return empty list
+if not cols_to_process:
+return []
+if cols_to_process.strip() != "*":
+# If cols_to_process is a comma separated list of names, 
return list
+# of column names in cols_to_process.
+return [col for col in 
split_quoted_delimited_str(cols_to_process)
+if col not in split_quoted_delimited_str(exclude_cols)]
+if source_table:
+if not self.all_cols:
+self.all_cols = get_cols(source_table)
+return [col for col in self.all_cols
+if col not in split_quoted_delimited_str(exclude_cols)]
+return []
+
+class vec2cols:
+def __init__(self):
+self.get_cols_helper = vec_cols_helper()
+self.module_name = self.__class__.__name__
+
+def validate_args(self, source_table, output_table, vector_col, 
feature_names,
+  cols_to_output):
+"""
+Validate args for vec2cols
+"""
+input_tbl_valid(source_table, self.module_name)
+output_tbl_valid(output_table, self.module_name)
+is_var_valid(source_table, cols_to_output)
+is_var_valid(source_table, vector_col)
+_assert(is_valid_psql_type(get_expr_type(vector_col, 
source_table), ANY_ARRAY),
+"{0}: vector_col should refer to an 
array.".format(self.module_name))
+_assert(is_col_1d_array(source_table, vector_col),
+"{0}: vector_col must be a 1-dimensional 
array.".format(self.module_name))
+
+def get_names_for_split_output_cols(self, source_table, vector_col, 
feature_names):
+"""
+Get list of names for the newly-split columns to include in the
+output table.
+Args:
+@param: source_table, str. Source table
+@param: vector_col, str. Column name containing the array input
+@param: feature_names, list. Python list of the feat

[GitHub] madlib pull request #291: Feature: Vector-Column Transformations

2018-07-27 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/291#discussion_r205914522
  
--- Diff: src/ports/postgres/modules/utilities/validate_args.py_in ---
@@ -513,11 +513,12 @@ def array_col_has_same_dimension(tbl, col):
 # 
 
 
-def explicit_bool_to_text(tbl, cols, schema_madlib):
+def explicit_bool_to_text(tbl, cols, schema_madlib, is_forced=False):
--- End diff --

w/ @ArvindSridhar On platforms that has bool to text casting (gpdb5, pg 
9.6, pg 10), we still need this to make sure we can create an array of bool and 
text types.


---


[GitHub] madlib pull request #:

2018-07-25 Thread orhankislal
Github user orhankislal commented on the pull request:


https://github.com/apache/madlib/commit/1fe308c70d2c91fef508d29d81ed0e93da429eb6#commitcomment-29832298
  
In src/madpack/madpack.py:
In src/madpack/madpack.py on line 973:
Let's say the user has 1.13 installed on a schema, uses rpm to get 1.14 but 
doesn't run the `madpack upgrade` command and tries the install-check. That 
will be caught in here and `Versions do not match` is accurate. If we want to 
check for schema not existing, that is a separate if check. I'll make a new 
commit to catch it.


---


[GitHub] madlib pull request #297: Madpack: Fix various schema related bugs

2018-07-20 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/297

Madpack: Fix various schema related bugs



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/madpack-extra

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/297.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #297


commit 31bfd8d3913576472dcb65feca5cb2dda01fa458
Author: Orhan Kislal 
Date:   2018-07-20T22:36:11Z

Madpack: Fix various schema related bugs




---


[GitHub] madlib pull request #289: RF: Add impurity variable importance

2018-07-09 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/289

RF: Add impurity variable importance

JIRA: MADLIB-1205

This commit makes the following changes:
- Add impurity variable importance for random forests.
- Rename current cat_var_importance and con_var_importance measurements to
oob_cat_var_importance and oob_con_var_importance.

New impurity measurement is provided as impurity_var_importance, and 
supports
grouping. It combines the importance values for both categorical and
continuous features into a single array.

Co-authored-by: Rahul Iyer 
Co-authored-by: Jingyi Mei 
Co-authored-by: Arvind Sridhar 
Co-authored-by: Nandish Jayaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib rf_gini_importance

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #289


commit 622d46a85f4264fdc94bd41dc66a23f1aa2c3ed6
Author: Rahul Iyer 
Date:   2018-07-10T00:34:33Z

RF: Add impurity variable importance

JIRA: MADLIB-1205

This commit makes the following changes:
- Add impurity variable importance for random forests.
- Rename current cat_var_importance and con_var_importance measurements to
oob_cat_var_importance and oob_con_var_importance.

New impurity measurement is provided as impurity_var_importance, and 
supports
grouping. It combines the importance values for both categorical and
continuous features into a single array.

Co-authored-by: Rahul Iyer 
Co-authored-by: Jingyi Mei 
Co-authored-by: Arvind Sridhar 
Co-authored-by: Nandish Jayaram 




---


[GitHub] madlib pull request #286: Build: Remove symlinks during rpm uninstall

2018-07-06 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/286#discussion_r200793992
  
--- Diff: pom.xml ---
@@ -66,6 +66,7 @@
   deploy/preflight.sh
   deploy/RPM/CMakeLists.txt
   deploy/rpm_post.sh
+  deploy/rpm_post_uninstall.sh
--- End diff --

Since this is a new file, we should add the apache license instead of 
adding it to the exclude list.


---


[GitHub] madlib pull request #276: Feature/dev check

2018-06-26 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/276


---


[GitHub] madlib pull request #276: Feature/dev check

2018-06-07 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/276

Feature/dev check



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/dev-check

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/276.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #276


commit 50a35b2e5799a5ddf6f4d23a7bca4566dbd1d05b
Author: Nandish Jayaram 
Date:   2018-06-06T23:42:56Z

Madpack: Add dev-check and a compact install-check.

- The current install check is expensive since it runs various hyper param
permutations for all MADlib modules. This commits moves all of those
tests to dev-check, which can be used by developers for iterating
faster. We have now created watered down install-check for each module,
which just runs one  hyper-param combination for each MADlib function,
and does not do any asserts.
- This commit also includes changes in madpack to add a new madpack
  option for dev-check.

TODO:
- complete trimming install check for all modules.
- update documentation for developer consumption.

Co-authored-by: Arvind Sridhar 

commit d6c7834f73d00d0bd3ddf84af524ba08725a6244
Author: Arvind Sridhar 
Date:   2018-06-07T21:51:47Z

Install-Check: Add new IC files for the lightweight testing

Co-authored-by: Orhan Kislal 




---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-24 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/271

Madpack: Make install, reinstall and upgrade atomic

We now write all the necessary sql into one file, and run it once in a
single session. The database's rollback will be useful to bring it back
to original state in case of a failure.

Co-authored-by: Rahul Iyer 
Co-authored-by: Orhan Kislal 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/atomic_upgrade

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/271.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #271


commit 693f6ae59dbb65d56fe4ff95bac8fde198f3d04f
Author: Nandish Jayaram 
Date:   2018-05-18T23:13:28Z

Madpack: Make install, reinstall and upgrade atomic

We now write all the necessary sql into one file, and run it once in a
single session. The database's rollback will be useful to bring it back
to original state in case of a failure.

Co-authored-by: Rahul Iyer 
Co-authored-by: Orhan Kislal 




---


[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...

2018-05-16 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/269


---


[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...

2018-05-16 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/269
  
Thanks for your comments Frank. Regarding the accuracy, I have tested the 
code with multiple groups from multiple grouping columns with hand-calculated 
values (as seen in the install-check). I have also tried a larger dataset (600 
columns). I have duplicated the data to create multiple columns and checked to 
see if the results match across groups. Please let us know if you have any 
other suggestions.


---


[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...

2018-05-07 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/269

Statistics: Add grouping support for correlation functions

JIRA: MADLIB-1128

This commit adds grouping support to correlation and covariance
functions in MADlib stats. Changes include relevant queries to do the
same.
This commit also has refactor changes to a helper function in
utilities.py_in.

Co-authored-by: Jingyi Mei 
Co-authored-by: Nikhil Kak 
Co-authored-by: Nandish Jayaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/correlation-grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/269.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #269


commit e9dd9ae88d9d2acaea68c093396f6d600148ede4
Author: Orhan Kislal 
Date:   2018-05-07T23:53:51Z

Statistics: Add grouping support for correlation functions

JIRA: MADLIB-1128

This commit adds grouping support to correlation and covariance
functions in MADlib stats. Changes include relevant queries to do the
same.
This commit also has refactor changes to a helper function in
utilities.py_in.

Co-authored-by: Jingyi Mei 
Co-authored-by: Nikhil Kak 
Co-authored-by: Nandish Jayaram 




---


[GitHub] madlib pull request #266: Release 1.14: Version numbering and upgrade relate...

2018-04-23 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/266#discussion_r183492733
  
--- Diff: src/madpack/changelist_1.13_1.14.yaml ---
@@ -0,0 +1,97 @@
+# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+
+# Changelist for MADlib version 1.13 to 1.14
+
+# This file contains all changes that were introduced in a new version of
+# MADlib. This changelist is used by the upgrade script to detect what 
objects
+# should be upgraded (while retaining all other objects from the previous 
version)
+
+# New modules (actually .sql_in files) added in upgrade version
+# For these files the sql_in code is retained as is with the functions in 
the
+# file installed on the upgrade version. All other files (that don't have
+# updates), are cleaned up to remove object replacements
+new module:
+# - Changes from 1.13 to 1.14 
--- End diff --

Added the missing modules.


---


[GitHub] madlib issue #266: Release 1.14: Version numbering and upgrade related chang...

2018-04-20 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/266
  
Thanks for the comments Rahul.


---


[GitHub] madlib pull request #266: Release 1.14: Version numbering and upgrade relate...

2018-04-20 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/266#discussion_r183180223
  
--- Diff: src/madpack/upgrade_util.py ---
@@ -82,16 +82,21 @@ def _get_function_info(self, oid):
 proname,
 textin(regtypeout(prorettype::regtype)) AS rettype,
 CASE array_upper(proargtypes,1) WHEN -1 THEN ''
-ELSE 
textin(regtypeout(unnest(proargtypes)::regtype))
+ELSE textin(regtypeout(foo))
 END AS argtype,
 CASE WHEN proargnames IS NULL THEN ''
-ELSE unnest(proargnames)
+ELSE bar
 END AS argname,
 CASE array_upper(proargtypes,1) WHEN -1 THEN 1
-ELSE generate_series(0, array_upper(proargtypes, 
1))
+ELSE zee
 END AS i
 FROM
-pg_proc AS p
+(SELECT *, oid,
+unnest(proargtypes)::regtype AS foo,
--- End diff --

We changed the query because PG 10 does not allow set returning functions 
in the `CASE` clause any more.


---


[GitHub] madlib pull request #266: Release 1.14: Version numbering and upgrade relate...

2018-04-19 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/266

Release 1.14: Version numbering and upgrade related changes

Updates the version number to 1.14 for the release candidate.
Updates the changelists and other related files for upgrade.
Note that upgrade is not supported from versions prior to 1.11.

Co-authored-by: Nikhil Kak 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib rel/upgrade_v114

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/266.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #266


commit c65351f29d7944b82e189489dad74128f8afe69f
Author: Orhan Kislal 
Date:   2018-04-19T17:23:47Z

Release 1.14: Version numbering and upgrade related changes

Updates the version number to 1.14 for the release candidate.
Updates the changelists and other related files for upgrade.
Note that upgrade is not supported from versions prior to 1.11.

Co-authored-by: Nikhil Kak 




---


[GitHub] madlib pull request #261: MLP: Check for 1-hot encoding of dependent variabl...

2018-04-11 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/261

MLP: Check for 1-hot encoding of dependent variable for minibatch

This commit adds a check to make sure that the dependent variable for mlp
minibatch is one hot encoded. This only validates that the dependent
variable array has more than 1 value.

Co-authored-by: Orhan Kislal 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
feature/mlp-encoded-dep-minibatch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/261.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #261


commit b101914cea0688f2969dd6bb2823cd340b02b243
Author: Nikhil Kak 
Date:   2018-04-10T23:40:49Z

MLP: Check for 1-hot encoding of dependent variable for minibatch

This commit adds a check to make sure that the dependent variable for mlp
minibatch is one hot encoded. This only validates that the dependent
variable array has more than 1 value.

Co-authored-by: Orhan Kislal 




---


[GitHub] madlib pull request #229: SVM: Add minibatch as a new solver

2018-01-24 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/229#discussion_r163690557
  
--- Diff: src/modules/convex/linear_svm_igd.cpp ---
@@ -120,6 +124,98 @@ linear_svm_igd_transition::run(AnyType &args) {
 return state;
 }
 
+/**
+ * @brief Perform the linear support vector machine transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+linear_svm_igd_minibatch_transition::run(AnyType &args) {
+// The real state.
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+SVMMinibatchState > state = args[0];
+
+// initialize the state if first tuple
+if (state.algo.numRows == 0) {
+
+LinearSVM::epsilon = args[9].getAs();;
+LinearSVM::is_svc = args[10].getAs();;
+if (!args[3].isNull()) {
+SVMMinibatchState > previousState = 
args[3];
+state.allocate(*this, previousState.task.nFeatures);
+state = previousState;
+} else {
+// configuration parameters
+uint32_t dimension = args[4].getAs();
+state.allocate(*this, dimension); // with zeros
+}
+// resetting in either case
+// state.reset();
+state.task.stepsize = args[5].getAs();
+const double lambda = args[6].getAs();
+const bool isL2 = args[7].getAs();
+const int nTuples = args[8].getAs();
+
+// The regularization operations called below (scaling and 
clipping)
+// need these class variables to be set.
+L1::n_tuples = nTuples;
+L2::n_tuples = nTuples;
+if (isL2)
+L2::lambda = lambda;
+else
+L1::lambda = lambda;
+}
+
+state.algo.nEpochs = args[12].getAs();
+state.algo.batchSize = args[13].getAs();
+
+// Skip the current record if args[1] (features) contains NULL values,
+// or args[2] is NULL
+try {
+args[1].getAs();
+} catch (const ArrayWithNullException &e) {
+return args[0];
+}
+if (args[2].isNull())
+return args[0];
+
+// tuple
+using madlib::dbal::eigen_integration::MappedColumnVector;
+
+MappedMatrix x(NULL);
+MappedColumnVector y(NULL);
+try {
+new (&x) MappedMatrix(args[1].getAs());
+new (&y) MappedColumnVector(args[2].getAs());
+} catch (const ArrayWithNullException &e) {
+return args[0];
+}
+SVMMiniBatchTuple tuple;
+tuple.indVar = trans(x);
+tuple.depVar = y;
+
+// each tuple can be weighted - this can be combination of the sample 
weight
+// and the class weight. Calling function is responsible for combining 
the two
+// into a single tuple weight. The default value for this parameter is 
1, set
+// into the definition of "tuple".
+// The weight is used to increase the value of a particular tuple for 
the online
+// learning. The weight is not used for the loss computation.
+tuple.weight = args[11].getAs();
+
+
+// Now do the transition step
+// apply Minibatching with regularization
+L2::scaling(state.task.model, state.task.stepsize);
+LinearSVMIGDAlgoMiniBatch::transitionInMiniBatch(state, tuple);
+L1::clipping(state.task.model, state.task.stepsize);
+
--- End diff --

Should we leave a comment on why the mini-batching transition step does not 
call the loss and gradient algorithms like the regular one?
On the other hand, I am not sure if we want to explain the lack of 
something in the comments. Maybe we can mention this implementation detail in 
the design docs?


---


[GitHub] madlib pull request #229: SVM: Add minibatch as a new solver

2018-01-24 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/229#discussion_r163120094
  
--- Diff: src/modules/convex/linear_svm_igd.cpp ---
@@ -120,6 +124,100 @@ linear_svm_igd_transition::run(AnyType &args) {
 return state;
 }
 
+/**
+ * @brief Perform the linear support vector machine transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+linear_svm_igd_minibatch_transition::run(AnyType &args) {
+// The real state.
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+SVMMinibatchState > state = args[0];
+
+// initialize the state if first tuple
+if (state.algo.numRows == 0) {
+
+LinearSVM::epsilon = args[9].getAs();;
+LinearSVM::is_svc = args[10].getAs();;
+if (!args[3].isNull()) {
+SVMMinibatchState > previousState = 
args[3];
+state.allocate(*this, previousState.task.nFeatures);
+state = previousState;
+} else {
+// configuration parameters
+uint32_t dimension = args[4].getAs();
+state.allocate(*this, dimension); // with zeros
+}
+// resetting in either case
+// state.reset();
--- End diff --

We should remove these lines if we don't need them.


---


[GitHub] madlib pull request #229: SVM: Add minibatch as a new solver

2018-01-24 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/229#discussion_r163689232
  
--- Diff: src/ports/postgres/modules/svm/svm.py_in ---
@@ -89,9 +113,9 @@ def _verify_table(source_table, model_table, 
dependent_varname,
 "('{dependent_varname}') for source_table "
 
"({source_table})!".format(dependent_varname=dependent_varname,
source_table=source_table))
-dep_type = get_expr_type(dependent_varname, source_table)
-if '[]' in dep_type:
-plpy.error("SVM error: dependent_varname cannot be of array 
type!")
+# dep_type = get_expr_type(dependent_varname, source_table)
--- End diff --

We should remove these lines if we don't need them.


---


[GitHub] madlib issue #228: Add centos 7 postgres 9.6/10 docker files for automated t...

2018-01-18 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/228
  
@njayaram2 Please review at your earliest convenience.


---


[GitHub] madlib issue #227: Add docker file for postgres 9.6 and 10

2018-01-18 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/227
  
Created a new pull request (#228). Closing this one.


---


[GitHub] madlib pull request #227: Add docker file for postgres 9.6 and 10

2018-01-18 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/227


---


[GitHub] madlib pull request #228: Add centos 7 postgres 9.6/10 docker files for auto...

2018-01-18 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/228

Add centos 7 postgres 9.6/10 docker files for automated testing.

Additional Author : Nikhil Kak 

Also added a readme to describe all the docker files.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/madlib docker-images

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #228


commit df69f2626e1c77e0f65e2bec76f9704dfc54e2bf
Author: Nikhil Kak and Orhan Kislal 
Date:   2018-01-18T22:42:22Z

Add centos 7 postgres 9.6/10 docker files for automated testing.

Additional Author : Nikhil Kak 

Also added a readme to describe all the docker files.




---


[GitHub] madlib pull request #227: Add docker file for postgres 9.6 and 10

2018-01-17 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/227

Add docker file for postgres 9.6 and 10

- install plpython support for postgres
- add dockerfile for postgres 10 centos 7
- add postgres bin dir to $PATH for both 9.6 and 10
- remove unnecessary files

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/madlib centos_postgres_docker

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/227.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #227


commit 81a634ad16f53f1de2afe2f768bcb059493c8313
Author: Nikhil Kak 
Date:   2017-11-09T22:54:01Z

Add docker file for postgres 9.6 and 10

- install plpython support for postgres
- add dockerfile for postgres 10 centos 7
- add postgres bin dir to $PATH for both 9.6 and 10
- remove unnecessary files




---


[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161864354
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_s

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161865238
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_s

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161863965
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_s

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161850906
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_s

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161297926
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_s

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161299042
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_s

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161296957
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
--- End diff --

is -> if ?


---


[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161297074
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_sizes:
+"""
+Compute sample sizes based on
+comman-delimited list of class_sizes
--- End diff --

comman -> comma


---


[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161300298
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_s

[GitHub] madlib pull request #223: Balance datasets : re-sampling technique

2018-01-16 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/223#discussion_r161845440
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,994 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import math
+import plpy
+import re
+from collections import defaultdict
+from fractions import Fraction
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import unique_string
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+from utilities.validate_args import get_cols
+from utilities.utilities import py_list_to_sql_string
+
+
+m4_changequote(`')
+
+def balance_sample(schema_madlib, source_table, output_table, class_col,
+class_sizes, output_table_size, grouping_cols, with_replacement, 
**kwargs):
+
+"""
+Balance sampling function
+Args:
+@param source_table   Input table name.
+@param output_table   Output table name.
+@param class_col  Name of the column containing the class 
to be
+  balanced.
+@param class_size Parameter to define the size of the 
different
+  class values.
+@param output_table_size  Desired size of the output data set.
+@param grouping_cols  The columns columns that defines the 
grouping.
+@param with_replacement   The sampling method.
+
+"""
+with MinWarning("warning"):
+
+class_counts = unique_string(desp='class_counts')
+desired_sample_per_class = 
unique_string(desp='desired_sample_per_class')
+desired_counts = unique_string(desp='desired_counts')
+
+if not class_sizes or class_sizes.strip().lower() in ('null', ''):
+class_sizes = 'uniform'
+
+_validate_strs(source_table, output_table, class_col, class_sizes,
+output_table_size, grouping_cols, with_replacement)
+
+source_table_columns = ','.join(get_cols(source_table))
+grp_by = "GROUP BY {0}".format(class_col)
+
+_create_frequency_distribution(class_counts, source_table, 
class_col)
+temp_views = [class_counts]
+
+if class_sizes.lower() == 'undersample' and not with_replacement:
+"""
+Random undersample without replacement.
+Randomly order the rows and give a unique (per class)
+identifier to each one.
+Select rows that have identifiers under the target limit.
+"""
+_undersampling_with_no_replacement(source_table, output_table, 
class_col,
+class_sizes, output_table_size, grouping_cols, 
with_replacement,
+class_counts, source_table_columns)
+
+_delete_temp_views(temp_views)
+return
+
+"""
+Create views for true and desired sample sizes of classes
+"""
+"""
+include_unsampled_classes tracks is unsampled classes are 
desired or not.
+include_unsampled_classes is always true in output_table_size 
Null cases but changes given values of desired sample class sizes in 
comma-delimited classsize paramter.
+"""
+include_unsampled_classes = True
+sampling_with_comma_delimited_class_sizes = class_sizes.find(':') 
> 0
+
+if sampling_with_comma_delimited_class_s

[GitHub] madlib pull request #224: 1.13 Upgrade and MLP IC fix

2018-01-11 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/224

1.13 Upgrade and MLP IC fix

JIRA: MADLIB-1197

Additional Author: Nandish Jayaram 

- 1.13 Upgrade does not drop the kNN help functions even though their
return types are changed. This commit adds the missing functions to the
changelist and alters the upgrade_util.py_in so that functions without
arguments can be dropped.

- Some assert thresholds are too strict for MLP in IC. This commit relaxes
those thresholds.

Closes #224

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/madlib bugfix/mlp_and_upgrade

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #224


commit 0fdf136b4e0ad0fd2c54bab2144045b11ba5884b
Author: Orhan Kislal 
Date:   2018-01-12T01:22:30Z

1.13 Upgrade and MLP IC fix

JIRA: MADLIB-1197

Additional Author: Nandish Jayaram 

- 1.13 Upgrade does not drop the kNN help functions even though their
return types are changed. This commit adds the missing functions to the
changelist and alters the upgrade_util.py_in so that functions without
arguments can be dropped.

- Some assert thresholds are too strict for MLP in IC. This commit relaxes
those thresholds.

Closes #224




---


[GitHub] madlib issue #219: Multiple: Hard-wire values for construct_array calls

2017-12-26 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/219
  
LGTM +1


---


[GitHub] madlib pull request #220: Add more stats to summary function

2017-12-22 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/220#discussion_r158568809
  
--- Diff: src/ports/postgres/modules/summary/Summarizer.py_in ---
@@ -199,6 +200,22 @@ class Summarizer:
 args['max_columns'] = ','.join([minmax_type('max', c) for c in 
cols])
 
 args['ntile_columns'] = "array_to_string(array[NULL], ',')"
+
+args['positive_columns'] = ','.join(["sum(case when {0} > 0 \
+   then 1 else 0 
end)".format(c['attname'])
+  if c['typname'] in numeric_types
+  else 'NULL' for c in cols])
+
+args["negative_columns"] = ','.join(["sum(case when {0} < 0 \
+   then 1 else 0 
end)".format(c['attname'])
+  if c['typname'] in numeric_types
+  else 'NULL' for c in cols])
+
+args["zero_columns"] = ','.join(["sum(case when {0} = 0 \
--- End diff --

In graph algorithms such as SSSP and APSP, we used `EPSILON = 0.01` for 
float comparisons. 


---


[GitHub] madlib issue #216: Release: Upgrade to v1.13

2017-12-15 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/216
  
Tested src and binary upgrades with success and fail scenarios. LGTM +1


---


[GitHub] madlib pull request #213: KNN: Move online help to python layer

2017-12-12 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/213


---


[GitHub] madlib pull request #213: KNN: Move online help to python layer

2017-12-11 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/213

KNN: Move online help to python layer

Additional Author: Nikhil Kak 

- Remove the dependency on the client message level for knn online help.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/madlib knn_help

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #213


commit a0b1e0a78ffc993f2e2efad8df9a2c49cfc0fcbb
Author: Orhan Kislal 
Date:   2017-12-11T23:27:09Z

KNN: Move online help to python layer

Additional Author: Nikhil Kak 

- Remove the dependency on the client message level for knn online help.




---


[GitHub] madlib issue #206: Feature: Allow NULL in rows for computing correlations an...

2017-12-05 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/206
  
Thank you @iyerr3 and @fmcquillan99 for your comments.


---


[GitHub] madlib pull request #194: Logregr: Add input validation for dep/indep variab...

2017-11-20 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/madlib/pull/194


---


[GitHub] madlib issue #194: Logregr: Add input validation for dep/indep variables

2017-11-20 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/194
  
I have decided to close this pull request since the default error given by 
the database is more descriptive.


---


[GitHub] madlib issue #200: Madpack: Move unit tests + refactor minor code

2017-11-15 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/200
  
Tested reinstall as well successful&unsuccessful upgrade on postgres. LGTM 
+1


---


[GitHub] madlib pull request #194: Logregr: Add input validation for dep/indep variab...

2017-11-14 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/194#discussion_r150943944
  
--- Diff: src/ports/postgres/modules/regress/logistic.py_in ---
@@ -158,12 +159,14 @@ def __logregr_validate_args(schema_madlib, 
tbl_source, tbl_output, dep_col,
 if not dep_col or dep_col.strip().lower() in ('null', ''):
 plpy.error("Logregr error: Invalid dependent column name!")
 
-# if not columns_exist_in_table(tbl_source, [dep_col]):
-# plpy.error("Logregr error: Dependent column does not exist!")
+if not is_var_valid(tbl_source, dep_col):
+plpy.error("Logregr error: Dependent variable is not valid!")
--- End diff --

Since the variable can be an expression and not a column, I wanted to avoid 
printing the whole expression and making the error message long and confusing. 
We can easily add it if you feel that would be more useful.


---


[GitHub] madlib issue #197: Fix madlib version parsing for upgrade

2017-11-13 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/197
  
jenkins ok to test


---


[GitHub] madlib issue #199: Bugfix: Hard coded schema name in WCC install check

2017-11-13 Thread orhankislal
Github user orhankislal commented on the issue:

https://github.com/apache/madlib/pull/199
  
LGTM


---


[GitHub] madlib pull request #197: Fix madlib version parsing for upgrade

2017-11-13 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/197#discussion_r150687658
  
--- Diff: src/madpack/upgrade_util.py ---
@@ -142,11 +142,11 @@ def _load(self):
 """
 
 # _mad_dbrev = 1.9.1
-if self._mad_dbrev.split('.') < '1.10.0'.split('.'):
+if map(int,self._mad_dbrev.split('.')) < 
map(int,'1.10.0'.split('.')):
--- End diff --

I was thinking of the first option as you suggested. I am not sure about 
your second suggestion. I tried all 4 of the combinations and couldn't get it 
working.


---


[GitHub] madlib pull request #197: Fix madlib version parsing for upgrade

2017-11-13 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/197#discussion_r150681944
  
--- Diff: src/madpack/upgrade_util.py ---
@@ -142,11 +142,11 @@ def _load(self):
 """
 
 # _mad_dbrev = 1.9.1
-if self._mad_dbrev.split('.') < '1.10.0'.split('.'):
+if map(int,self._mad_dbrev.split('.')) < 
map(int,'1.10.0'.split('.')):
--- End diff --

Importing those files from madpack.py creates a dependency circle.


---


[GitHub] madlib pull request #198: PMML: Update the pyxb version number to 1.2.6

2017-11-13 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/198

PMML: Update the pyxb version number to 1.2.6



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/madlib pyxb_version

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/198.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #198






---


[GitHub] madlib pull request #197: Fix madlib version parsing for upgrade

2017-11-13 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/madlib/pull/197

Fix madlib version parsing for upgrade



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/madlib upgrade

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/197.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #197


commit 7cd15c79dab8479c1a1cace18506b02f3f1ddf43
Author: Orhan Kislal 
Date:   2017-11-03T00:08:49Z

Fix madlib version parsing for upgrade




---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-13 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r150629090
  
--- Diff: src/ports/postgres/modules/graph/graph_utils.py_in ---
@@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, 
edge_table, edge_params,
 
 return None
 
+def validate_params_for_centrality_measures(schema_madlib, func_name,
--- End diff --

This function name is a bit confusing since it isn't used by the centrality 
measures functions from `measures.py_in`


---


  1   2   >