[GitHub] madlib issue #345: Typos

2019-01-10 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/345
  
@kinow Thanks for the PR . LGTM


---


[GitHub] madlib pull request #339: Build: Add PG11 Support

2018-11-28 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/339#discussion_r237326034
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -407,6 +407,15 @@ def set_client_min_messages(new_level):
 return old_msg_level
 # -
 
+def is_less_than_pg11(schema_madlib, **kwargs):
--- End diff --

I noticed that we have another function called `is_less_than_pg90` which 
has the same code as this function. We probably don't need the pg90 function 
anymore but it would be nice if the `is_less_than_pg11`  function can be 
generalized to take in a major postgres version and then compare the current 
version with the passed major version.


---


[GitHub] madlib pull request #338: Install/Dev check: Add new test cases for some mod...

2018-11-15 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/338#discussion_r234052747
  
--- Diff: src/ports/postgres/modules/pmml/test/pmml.ic.sql_in ---
@@ -0,0 +1,119 @@
+/* --- 
*//**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ *//* 
--- */
+DROP TABLE IF EXISTS abalone CASCADE;
+
+CREATE TABLE abalone (
+id integer,
+sex text,
+length double precision,
+diameter double precision,
+height double precision,
+whole double precision,
+shucked double precision,
+viscera double precision,
+shell double precision,
+rings integer
+);
+
+INSERT INTO abalone VALUES
+(3151, 'F', 0.655027, 0.505004, 
0.165008, 1.36699, 0.583519, 
0.351479, 0.396019, 10),
--- End diff --

Since this is an install check test, can we cut down on the dataset size 
and call the `glm` and the `pmml` functions only once. 


---


[GitHub] madlib pull request #334: Minibatch Preprocessor: Update online doc

2018-11-15 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/334#discussion_r234051398
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -487,10 +487,16 @@ class MiniBatchDocumentation:
 
 SUMMARY
 
-MiniBatch Preprocessor is a utility function to pre process the 
input
-data for use with models that support mini-batching as an 
optimization
+The mini-batch preprocessor is a utility that prepares input data 
for
+use by models that support mini-batch as an optimization option. 
(This
+is currently only the case for Neural Networks.) It is effectively 
a
--- End diff --

/s/Neural Networks/Neural Network


---


[GitHub] madlib pull request #334: Minibatch Preprocessor: Update online doc

2018-11-15 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/334#discussion_r234051252
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -487,10 +487,16 @@ class MiniBatchDocumentation:
 
 SUMMARY
 
-MiniBatch Preprocessor is a utility function to pre process the 
input
-data for use with models that support mini-batching as an 
optimization
+The mini-batch preprocessor is a utility that prepares input data 
for
+use by models that support mini-batch as an optimization option. 
(This
+is currently only the case for Neural Networks.) It is effectively 
a
+packing operation that builds arrays of dependent and independent
+variables from the source data table.
 
-#TODO add more here
+The advantage of using mini-batching is that it can perform better 
than
+stochastic gradient descent (default MADlib optimizer) because it 
uses
+more than one training example at a time, typically resulting 
faster
--- End diff --

missing the word in `resulting in faster .`


---


[GitHub] madlib pull request #334: Minibatch Preprocessor: Update online doc

2018-11-15 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/334#discussion_r234051503
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -487,10 +487,16 @@ class MiniBatchDocumentation:
 
 SUMMARY
 
-MiniBatch Preprocessor is a utility function to pre process the 
input
-data for use with models that support mini-batching as an 
optimization
+The mini-batch preprocessor is a utility that prepares input data 
for
+use by models that support mini-batch as an optimization option. 
(This
+is currently only the case for Neural Networks.) It is effectively 
a
+packing operation that builds arrays of dependent and independent
--- End diff --

should we instead say `build matrix of independent variable(s) and arrays 
of dependent variable` ?


---


[GitHub] madlib pull request #334: Minibatch Preprocessor: Update online doc

2018-11-15 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/334#discussion_r234051175
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -508,8 +514,13 @@ class MiniBatchDocumentation:
 dependent_varname, -- TEXT. Name of the dependent variable 
column
 independent_varname,   -- TEXT. Name of the independent 
variable
   column
-buffer_size-- INTEGER. Number of source input rows 
to
-  pack into batch
+grouping_col   -- TEXT. Default NULL. An expression 
list used
+  to group the input dataset into 
discrete groups
+buffer_size-- INTEGER. Default computed 
automatically.
+  Number of source input rows to pack 
into batch
--- End diff --

/s/batch/buffer


---


[GitHub] madlib pull request #332: Update Dockerfile to use ubuntu 16.04

2018-10-24 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/332#discussion_r227919519
  
--- Diff: tool/docker/base/Dockerfile_ubuntu16_postgres10 ---
@@ -0,0 +1,108 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+FROM ubuntu:16.04
+
+### Get necessary libraries to add postgresql apt repository
+RUN apt-get update && apt-get install -y lsb-core 
software-properties-common wget
+
+### Add postgresql apt repository
+RUN add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ 
$(lsb_release -sc)-pgdg main" && wget --quiet -O - 
https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - 
+
+### Have to update after getting new repository
+RUN apt-get update
+
+### Get postgres10 and postgres specific add-ons
+RUN apt-get install -y postgresql-10 \
+   postgresql-client-10 \
+   postgresql-plpython-10 \ 
+   postgresql-server-dev-10 \
+   libpq-dev \
+   build-essential \
+   openssl \
+   libssl-dev \
+   libboost-all-dev \
+   m4 \
+   vim \
+   pgxnclient \
+   flex \
+   bison \
+   graphviz
+
+### Reset pg_hba.conf file to allow no password prompt
+### Docker file doesn't support heardoc, like cat > 
/etc/postgresql/10/main/pg_hba.conf<<-EOF,
+### and this echo and \n\ are workaround to write the file
+RUN echo " \n\
+# Database administrative login by Unix domain socket \n\
+local   all all trust 
\n\
+
+# TYPE  DATABASEUSERADDRESS METHOD 
\n\
+
+# "local" is for Unix domain socket connections only \n\
+local   all all trust 
\n\
+# IPv4 local connections: \n\
+hostall all 127.0.0.1/32trust 
\n\
+# IPv6 local connections: \n\
+hostall all ::1/128 trust 
\n\
+" > /etc/postgresql/10/main/pg_hba.conf
+
+### We need to set nproc to unlimited to be able to run scripts as
+### the user 'postgres'. This is actually useful when we try to setup
+### and start a Postgres server.
+RUN echo " * soft nproc unlimited " > 
/etc/security/limits.d/postgres-limits.conf
+
+
+### Always start postgres server when login
+RUN echo "service postgresql start" >> ~/.bashrc
+
+### Build custom CMake with SSQL support
+RUN wget https://cmake.org/files/v3.6/cmake-3.6.1.tar.gz && \
+  tar -zxvf cmake-3.6.1.tar.gz && \
+  cd cmake-3.6.1 && \
+  sed -i 's/-DCMAKE_BOOTSTRAP=1/-DCMAKE_BOOTSTRAP=1 
-DCMAKE_USE_OPENSSL=ON/g' bootstrap && \
+  ./configure &&  \
+  make -j2 && \
+  make install && \
+  cd ..
+
+### Install doxygen-1.8.13:
+RUN wget http://ftp.stack.nl/pub/users/dimitri/doxygen-1.8.13.src.tar.gz 
&& \
+  tar xf doxygen-1.8.13.src.tar.gz && \
+  cd doxygen-1.8.13 && \
+  mkdir build && \
+  cd build && \
+  cmake -G "Unix Makefiles" .. && \
+  make && \
+  make install
+
+### Optional: install LaTex
+### uncomment the following 'RUN apt-get' line to bake LaTex into the image
+### Note: if you run the following line, please tag the image as
+### madlib/postgres_10:LaTex, and don't tag it as latest
+# RUN apt-get install -y texlive-full
+
+## To build an image from this docker file without LaTex, from madlib 
folder, r

[GitHub] madlib pull request #332: Update Dockerfile to use ubuntu 16.04

2018-10-24 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/332#discussion_r227918905
  
--- Diff: tool/docker/base/Dockerfile_ubuntu16_postgres10 ---
@@ -0,0 +1,108 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+FROM ubuntu:16.04
+
+### Get necessary libraries to add postgresql apt repository
+RUN apt-get update && apt-get install -y lsb-core 
software-properties-common wget
+
+### Add postgresql apt repository
+RUN add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ 
$(lsb_release -sc)-pgdg main" && wget --quiet -O - 
https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - 
+
+### Have to update after getting new repository
+RUN apt-get update
+
+### Get postgres10 and postgres specific add-ons
+RUN apt-get install -y postgresql-10 \
+   postgresql-client-10 \
+   postgresql-plpython-10 \ 
+   postgresql-server-dev-10 \
+   libpq-dev \
+   build-essential \
+   openssl \
+   libssl-dev \
+   libboost-all-dev \
+   m4 \
+   vim \
+   pgxnclient \
+   flex \
+   bison \
+   graphviz
+
+### Reset pg_hba.conf file to allow no password prompt
+### Docker file doesn't support heardoc, like cat > 
/etc/postgresql/10/main/pg_hba.conf<<-EOF,
+### and this echo and \n\ are workaround to write the file
+RUN echo " \n\
+# Database administrative login by Unix domain socket \n\
+local   all all trust 
\n\
+
+# TYPE  DATABASEUSERADDRESS METHOD 
\n\
+
+# "local" is for Unix domain socket connections only \n\
+local   all all trust 
\n\
+# IPv4 local connections: \n\
+hostall all 127.0.0.1/32trust 
\n\
+# IPv6 local connections: \n\
+hostall all ::1/128 trust 
\n\
+" > /etc/postgresql/10/main/pg_hba.conf
+
+### We need to set nproc to unlimited to be able to run scripts as
+### the user 'postgres'. This is actually useful when we try to setup
+### and start a Postgres server.
+RUN echo " * soft nproc unlimited " > 
/etc/security/limits.d/postgres-limits.conf
+
+
+### Always start postgres server when login
+RUN echo "service postgresql start" >> ~/.bashrc
+
+### Build custom CMake with SSQL support
+RUN wget https://cmake.org/files/v3.6/cmake-3.6.1.tar.gz && \
+  tar -zxvf cmake-3.6.1.tar.gz && \
+  cd cmake-3.6.1 && \
+  sed -i 's/-DCMAKE_BOOTSTRAP=1/-DCMAKE_BOOTSTRAP=1 
-DCMAKE_USE_OPENSSL=ON/g' bootstrap && \
+  ./configure &&  \
+  make -j2 && \
+  make install && \
+  cd ..
+
+### Install doxygen-1.8.13:
+RUN wget http://ftp.stack.nl/pub/users/dimitri/doxygen-1.8.13.src.tar.gz 
&& \
--- End diff --

we don't need to compile doxygen if we are not going to run `make doc`


---


[GitHub] madlib pull request #331: Build: Include preflight and postflight scripts fo...

2018-10-09 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/331

Build: Include preflight and postflight scripts for mac

Commit #441f16bd55d2a26e4dd59df6129c6092f099cbca introduced a bug where
the preflight and postflight scripts for mac were not getting included
in the .dmg file. This commit adds an if check for the APPLE platform to
include these scripts necessary for dmg installation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/cmake-package

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/331.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #331


commit 08d85c0b09765ce4aea045611f62843db4c83e94
Author: Nikhil Kak 
Date:   2018-10-09T19:07:06Z

Build: Include preflight and postflight scripts for mac

Commit #441f16bd55d2a26e4dd59df6129c6092f099cbca introduced a bug where
the preflight and postflight scripts for mac were not getting included
in the .dmg file. This commit adds an if check for the APPLE platform to
include these scripts necessary for dmg installation.




---


[GitHub] madlib pull request #330: Margins: Copy summary table instead of renaming

2018-10-03 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/330#discussion_r222415826
  
--- Diff: src/ports/postgres/modules/regress/marginal.py_in ---
@@ -522,9 +520,8 @@ def margins_mlogregr(schema_madlib, source_table, 
out_table,
 """.format(**all_arguments))
 
 # Rename the output summary table
-rename_table(schema_madlib,
- "{old}_summary".format(old=mlog_out_table),
- "{new}_summary".format(new=out_table))
+plpy.execute("""CREATE TABLE {out_table}_summary AS
--- End diff --

we should ideally drop the `mlog_out_table` table even if it is in the 
pg_temp schema. Otherwise it will be always be present in the current session. 


---


[GitHub] madlib pull request #330: Margins: Copy summary table instead of renaming

2018-10-03 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/330#discussion_r222418291
  
--- Diff: src/ports/postgres/modules/regress/marginal.py_in ---
@@ -522,9 +520,8 @@ def margins_mlogregr(schema_madlib, source_table, 
out_table,
 """.format(**all_arguments))
 
 # Rename the output summary table
-rename_table(schema_madlib,
- "{old}_summary".format(old=mlog_out_table),
- "{new}_summary".format(new=out_table))
+plpy.execute("""CREATE TABLE {out_table}_summary AS
--- End diff --

can we add a dev-check test to assert that the output summary table should 
exist ?


---


[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation

2018-09-14 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/318
  
@orhankislal The new changes look good. Apart from the 2 minor comments 
LGTM +1 


---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-14 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r217842978
  
--- Diff: src/madpack/create_changelist.py ---
@@ -0,0 +1,239 @@
+#!/usr/bin/python
+# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+
+# Create changelist for any two branches/tags
+
+# Prequisites:
+# The old version has to be installed in the "madlib_old_vers" schema
+# The new version has to be installed in the "madlib" (default) schema
+# Two branches/tags must exist locally (run 'git fetch' to ensure you have 
the latest version)
+# The current branch does not matter
+
+# Usage (must be executed in the src/madpack directory):
+# python create_changelist.py
+# If you are using the master branch, plase make sure to edit the 
branch/tag in the output file
+
+# Example (should be equivalent to changelist_1.13_1.14.yaml):
+# python create_changelist.py madlib rel/v1.13 rel/v1.14 chtest1.yaml
+
+import sys
+import os
+
+database = sys.argv[1]
+old_vers = sys.argv[2]
+new_vers = sys.argv[3]
+ch_filename = sys.argv[4]
+
+if os.path.exists(ch_filename):
+print "{0} already exists".format(ch_filename)
+raise SystemExit
+
+err1 = os.system("""psql {0} -l > /dev/null""".format(database))
+if err1 != 0:
+print "Database {0} does not exist".format(database)
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib_old_vers.version()" > 
/dev/null
+ """.format(database))
+if err1 != 0:
+print "MADlib is not installed in the madlib_old_vers schema. Please 
refer to the Prequisites."
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib.version()" > /dev/null
+ """.format(database))
+if err1 != 0:
+print "MADlib is not installed in the madlib schema. Please refer to 
the Prequisites."
+raise SystemExit
+
+print "Creating changelist {0}".format(ch_filename)
+os.system("rm -f /tmp/madlib_tmp_nm.txt /tmp/madlib_tmp_udf.txt 
/tmp/madlib_tmp_udt.txt")
+try:
+# Find the new modules using the git diff
+err1 = os.system("git diff {old_vers} {new_vers} --name-only 
--diff-filter=A > /tmp/madlib_tmp_nm.txt".format(**locals()))
+if err1 != 0:
+print "Git diff failed. Please ensure that branches/tags are 
fetched."
+raise SystemExit
+
+f = open("/tmp/madlib_tmp_cl.yaml", "w")
+f.write(
+"""# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+""")
+
+f.write(
+""

[GitHub] madlib pull request #319: Allocator: Remove 16-byte alignment for pointers i...

2018-09-13 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/319#discussion_r217498702
  
--- Diff: src/ports/greenplum/dbconnector/dbconnector.hpp ---
@@ -32,6 +32,24 @@ extern "C" {
 
 #include "Compatibility.hpp"
 
+#if PG_VERSION_NUM >= 9
+// MADlib aligns the pointers returned by palloc() to 16-byte 
boundaries
+// (see Allocator_impl.hpp). This is done to allow Eigen vectorization 
 (see
+// http://eigen.tuxfamily.org/index.php?title=FAQ#Vectorization for 
more
+// info).  This vectorization has to be explicitly disabled if 
pointers are
+// not 16-byte aligned.
+// Further, the pointer realignment invalidates a header that palloc 
creates
+// just prior to the pointer address.  Greenplum after commit <> fails 
due
+// to this invalid header.  Hence, the pointer realignment (and Eigen
+// vectorization) is disabled below for Greenplum 6 and above.
+#define DISABLE_POINTER_ALIGNMENT_FOR_GREENPLUM
--- End diff --

Do we really need this macro ? We can use the `EIGEN_DONT_VECTORIZE` macro 
instead. 


---


[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation

2018-09-12 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/318
  
I tested the utility with the new changes but I am still reviewing the 
code.Here are my observations so far:

1. I noticed that there were a few indentation changes. I am assuming that 
this will not affect upgrade testing.
1. We will run this utility before tagging the new release. So for ex if 
the previous release was 1.15 and the new release is 1.16, then the utility 
will be run when the version is 1.16-dev on the master branch. In this case, 
how should the utility be run? Should it be run against the master branch for 
the latest release? This is important because the branch name is used for the 
description message `Changelist for MADlib version old_branch/tag to 
new_branch/tag`



---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-12 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r217152555
  
--- Diff: src/madpack/create_changelist.py ---
@@ -0,0 +1,229 @@
+#!/usr/bin/python
+# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+
+# Create changelist for any two branches/tags
+
+# Prequisites:
+# The old version has to be installed in the "madlib_old_vers" schema
+# The new version has to be installed in the "madlib" (default) schema
+# Two branches/tags must exist locally (run 'git fetch' to ensure you have 
the latest version)
+# The current branch does not matter
+
+# Usage (must be executed in the src/madpack directory):
+# python create_changelist.py
+
+# Example (should be equivalent to changelist_1.13_1.14.yaml):
+# python create_changelist.py madlib rel/v1.13 rel/v1.14 chtest1.yaml
+
+import sys
+import os
+
+database = sys.argv[1]
+old_vers = sys.argv[2]
+new_vers = sys.argv[3]
+ch_filename = sys.argv[4]
+
+if os.path.exists(ch_filename):
+print "{0} already exists".format(ch_filename)
+raise SystemExit
+
+err1 = os.system("""psql {0} -l > /dev/null""".format(database))
+if err1 != 0:
+print "Database {0} does not exist".format(old_vers)
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib_old_vers.version()" > 
/dev/null
+ """.format(database))
+if err1 != 0:
+print "Schema {0} does not exist".format(old_vers)
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib.version()" > /dev/null
+ """.format(database))
+if err1 != 0:
+print "Schema {0} does not exist".format(new_vers)
+raise SystemExit
+
--- End diff --

Can we add another error check to ensure that the 
`madlib_old_vers.schema()` is the same as `old_vers branch/tag`? This does 
assume that the branch/tag will always be a release tag.


---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-12 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r217143101
  
--- Diff: src/madpack/create_changelist.py ---
@@ -0,0 +1,229 @@
+#!/usr/bin/python
+# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+
+# Create changelist for any two branches/tags
+
+# Prequisites:
+# The old version has to be installed in the "madlib_old_vers" schema
+# The new version has to be installed in the "madlib" (default) schema
+# Two branches/tags must exist locally (run 'git fetch' to ensure you have 
the latest version)
+# The current branch does not matter
+
+# Usage (must be executed in the src/madpack directory):
+# python create_changelist.py
+
+# Example (should be equivalent to changelist_1.13_1.14.yaml):
+# python create_changelist.py madlib rel/v1.13 rel/v1.14 chtest1.yaml
+
+import sys
+import os
+
+database = sys.argv[1]
+old_vers = sys.argv[2]
+new_vers = sys.argv[3]
+ch_filename = sys.argv[4]
+
+if os.path.exists(ch_filename):
+print "{0} already exists".format(ch_filename)
+raise SystemExit
+
+err1 = os.system("""psql {0} -l > /dev/null""".format(database))
+if err1 != 0:
+print "Database {0} does not exist".format(old_vers)
+raise SystemExit
+
+err1 = os.system("""psql {0} -c "select madlib_old_vers.version()" > 
/dev/null
+ """.format(database))
+if err1 != 0:
+print "Schema {0} does not exist".format(old_vers)
--- End diff --

This error message prints the branch name instead of the schema name. Same 
with the previous and next error checks. Maybe we should change the var name to 
be more descriptive


---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-10 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r216443402
  
--- Diff: src/madpack/create_changelist.py ---
@@ -0,0 +1,132 @@
+#!/usr/bin/python
+# 
--
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# 
--
+
+# Create changelist for any two branches
+
+# Prequisites:
+# The old version has to be installed in the "madlib_old_vers" schema
+# The new version has to be installed in the "madlib" (default) schema
+# Two branches must exist locally (run 'git fetch' to ensure you have the 
latest version)
--- End diff --

Can we expand on this comment ? Also mention that the old/new version 
branches should be the release tags and not branches.




---


[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...

2018-09-10 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/318#discussion_r216438294
  
--- Diff: src/madpack/diff_udf.sql ---
@@ -142,9 +142,12 @@ DROP TABLE IF EXISTS functions_madlib_new_version;
 SELECT get_functions('madlib_old_vers');
 
 SELECT
+type,
 --'\t-' || name || ':' || '\n\t\t-rettype: ' || retype || 
'\n\t\t-argument: ' || argtypes
-'- ' || name || ':' || '\nrettype: ' || retype || '\n  
  argument: ' || argtypes AS "Dropped UDFs"
-, type
+'- ' || name || ':' AS "Dropped UDF part1",
--- End diff --

I think it would be helpful to add comments in the commit description 
explaining the reasoning behind modifying the `diff_udf.sql` and 
`diff_udt.sql`. Just looking at these changes, it's hard to know why for ex 
`rettype` was removed and why we need `part1` at the end


---


[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation

2018-09-10 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/318
  
I ran the utility with 1.13 and 1.14 as the release tags and noticed that 
the output file created by this utility isn't the same as the one checked in 
`changelist_1.13_1.14.yaml`. For ex, all the summary functions are missing from 
the udf section. Also the udt section has an extra entry `mlp_step_result[]:`


---


[GitHub] madlib pull request #313: MLP: Simplify momentum and Nesterov updates

2018-08-21 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/313#discussion_r211725199
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -192,18 +189,23 @@ MLP::getLossAndUpdateModel(
 //  1. normalize to per row update
 //  2. discount by stepsize
 //  3. add regularization
-//  4. make negative
+//  4. make negative for descent
 for (Index k=0; k < model.num_layers; k++){
 Matrix regularization = MLP::lambda * model.u[k];
 regularization.row(0).setZero(); // Do not update bias
-total_gradient_per_layer[k] = -stepsize * 
(total_gradient_per_layer[k] / static_cast(num_rows_in_batch) +
-  regularization);
-model.updateVelocity(total_gradient_per_layer[k], k);
-model.updatePosition(total_gradient_per_layer[k], k);
+total_gradient_per_layer[k] = -stepsize *
+(total_gradient_per_layer[k] / 
static_cast(num_rows_in_batch) +
+ regularization);
+
+// total_gradient_per_layer is now the update vector
+model.velocity[k] = model.momentum * model.velocity[k] + 
total_gradient_per_layer[k];
--- End diff --

we should only update the velocity vector if momentum > 0


---


[GitHub] madlib pull request #313: MLP: Simplify momentum and Nesterov updates

2018-08-21 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/313#discussion_r211725456
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -192,18 +189,23 @@ MLP::getLossAndUpdateModel(
 //  1. normalize to per row update
 //  2. discount by stepsize
 //  3. add regularization
-//  4. make negative
+//  4. make negative for descent
 for (Index k=0; k < model.num_layers; k++){
 Matrix regularization = MLP::lambda * model.u[k];
 regularization.row(0).setZero(); // Do not update bias
-total_gradient_per_layer[k] = -stepsize * 
(total_gradient_per_layer[k] / static_cast(num_rows_in_batch) +
-  regularization);
-model.updateVelocity(total_gradient_per_layer[k], k);
-model.updatePosition(total_gradient_per_layer[k], k);
+total_gradient_per_layer[k] = -stepsize *
+(total_gradient_per_layer[k] / 
static_cast(num_rows_in_batch) +
+ regularization);
+
+// total_gradient_per_layer is now the update vector
+model.velocity[k] = model.momentum * model.velocity[k] + 
total_gradient_per_layer[k];
--- End diff --

can we also add a comment for why we are duplicating the code instead of 
creating a function for these updates ?


---


[GitHub] madlib pull request #310: Build: Download compatible Boost if version >= 1.6...

2018-08-13 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/310#discussion_r209718447
  
--- Diff: src/CMakeLists.txt ---
@@ -103,21 +103,29 @@ set(MAD_MODULE_DIR 
${CMAKE_CURRENT_SOURCE_DIR}/modules)
 # -- Third-party dependencies: Find or download Boost 
--
 
 find_package(Boost 1.47)
-
-# We use BOOST_ASSERT_MSG, which only exists in Boost 1.47 and later.
-# Unfortunately, the FindBoost module seems to be broken with respect to 
version
-# checking, so we will set Boost_FOUND to FALSE if the version is too old.
 if(Boost_FOUND)
+# We use BOOST_ASSERT_MSG, which only exists in Boost 1.47 and later.
+# Unfortunately, the FindBoost module seems to be broken with respect 
to
+# version checking, so we will set Boost_FOUND to FALSE if the version 
is
+# too old.
 if(Boost_VERSION LESS 104600)
+message(STATUS "No sufficiently recent version (>= 1.47) of Boost 
was found. Will download.")
+set(Boost_FOUND FALSE)
+endif(Boost_VERSION LESS 104600)
+
+# BOOST 1.65.0 removed the TR1 library which is required by MADlib till
+# C++11 is completely supported. Hence, we force download of a 
compatible
+# version if existing Boost is 1.65 or greater. FIXME: This should be
+# removed when TR1 dependency is removed.
+if(NOT Boost_VERSION LESS 106500)
--- End diff --

maybe add to the log message that we found a boost version > 1.65 which is 
incompatible. 


---


[GitHub] madlib-site pull request #13: Add notebook for deep learning with keras

2018-08-10 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib-site/pull/13

Add notebook for deep learning with keras

This iPython notebook walks the user through training and testing a deep
learning model built with tensorflow and keras in Greenplum 5.X. The
model is constructed within a plpython function in SQL, and this
function is invoked through a SQL UDF (with the training/test data
passed in).

Co-authored-by: Arvind Sridhar 
Co-authored-by: Rahul Iyer 
Co-authored-by: Orhan Kislal 
Co-authored-by: Jingyi Mei 
Co-authored-by: Domino Valdano 
Co-authored-by: Nandish Jayaram 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kaknikhil/madlib-site gpu-keras-notebook

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib-site/pull/13.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13


commit 765d4b31db64a06a9f0e7a9f1d548d6a1a4f
Author: Nikhil Kak 
Date:   2018-08-08T18:00:45Z

Add notebook for deep learning with keras

This iPython notebook walks the user through training and testing a deep
learning model built with tensorflow and keras in Greenplum 5.X. The
model is constructed within a plpython function in SQL, and this
function is invoked through a SQL UDF (with the training/test data
passed in).

Co-authored-by: Arvind Sridhar 
Co-authored-by: Rahul Iyer 
Co-authored-by: Orhan Kislal 
Co-authored-by: Jingyi Mei 
Co-authored-by: Domino Valdano 
Co-authored-by: Nandish Jayaram 




---


[GitHub] madlib pull request #282: Utilites: Add CTAS while dropping some columns

2018-06-28 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/282#discussion_r199006902
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -853,6 +857,34 @@ def validate_module_input_params(source_table, 
output_table, independent_varname
source_table=source_table))
 # 
 
+
+def create_table_drop_cols(source_table, out_table,
+   cols_to_drop, **kwargs):
+""" Create copy of table while dropping some of the columns
+Args:
+@param source_table str. Name of the source table
+@param out_table str. Name of the output table
+@param cols_to_drop str. Comma-separated list of columns to drop
+"""
+input_tbl_valid(source_table, 'Utilities')
+output_tbl_valid(out_table, 'Utilities')
+
--- End diff --

should we error out if the requested column to drop is not a valid column 
in the table? If not , then we should add a note about this in the user docs


---


[GitHub] madlib pull request #282: Utilites: Add CTAS while dropping some columns

2018-06-28 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/282#discussion_r199019606
  
--- Diff: src/ports/postgres/modules/utilities/test/utilities.sql_in ---
@@ -0,0 +1,77 @@
+/* --- 
*/
+/**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ */
+/* --- 
*/
+
+CREATE TABLE "__madlib_temp_Quoted"(b varchar);
+CREATE TABLE __madlib_temp_non_quoted(a text);
+-- assert that madlib_temp tables are created
+SELECT assert(count(*) >= 2, 'Error setting up madlib_temp in schema ' || 
quote_ident(current_schema()))
+FROM pg_tables
+WHERE tablename LIKE '%madlib\_temp%'
+  AND quote_ident(schemaname) = quote_ident(current_schema());
+
+-- cleanup
+SELECT cleanup_madlib_temp_tables(quote_ident(current_schema()));
+
+-- assert that madlib_temp tables are dropped
+SELECT assert(count(*) = 0, 'Error cleaning up madlib_temp in schema ' || 
quote_ident(current_schema()))
+FROM pg_tables
+WHERE tablename LIKE '%madlib\_temp%'
+  AND quote_ident(schemaname) = quote_ident(current_schema());
+
+-- test dropcols
+DROP TABLE IF EXISTS dt_golf CASCADE;
+CREATE TABLE dt_golf (
+id integer NOT NULL,
+id_2 integer,
+"OUTLOOK" text,
+temperature double precision,
+humidity double precision,
+"Cont,features" double precision[],
+cat_features text[],
+windy boolean,
+class text
+) ;
+
+INSERT INTO dt_golf 
(id,"OUTLOOK",temperature,humidity,"Cont,features",cat_features, windy,class) 
VALUES
--- End diff --

should we add some more special chars like 
https://github.com/apache/madlib/pull/281 ?


---


[GitHub] madlib pull request #282: Utilites: Add CTAS while dropping some columns

2018-06-28 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/282#discussion_r199002937
  
--- Diff: src/ports/postgres/modules/utilities/test/utilities.sql_in ---
@@ -0,0 +1,77 @@
+/* --- 
*/
+/**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ */
+/* --- 
*/
+
+CREATE TABLE "__madlib_temp_Quoted"(b varchar);
+CREATE TABLE __madlib_temp_non_quoted(a text);
+-- assert that madlib_temp tables are created
+SELECT assert(count(*) >= 2, 'Error setting up madlib_temp in schema ' || 
quote_ident(current_schema()))
+FROM pg_tables
+WHERE tablename LIKE '%madlib\_temp%'
+  AND quote_ident(schemaname) = quote_ident(current_schema());
+
+-- cleanup
+SELECT cleanup_madlib_temp_tables(quote_ident(current_schema()));
+
+-- assert that madlib_temp tables are dropped
+SELECT assert(count(*) = 0, 'Error cleaning up madlib_temp in schema ' || 
quote_ident(current_schema()))
+FROM pg_tables
+WHERE tablename LIKE '%madlib\_temp%'
+  AND quote_ident(schemaname) = quote_ident(current_schema());
+
+-- test dropcols
+DROP TABLE IF EXISTS dt_golf CASCADE;
+CREATE TABLE dt_golf (
+id integer NOT NULL,
+id_2 integer,
+"OUTLOOK" text,
+temperature double precision,
+humidity double precision,
+"Cont,features" double precision[],
+cat_features text[],
+windy boolean,
+class text
+) ;
+
+INSERT INTO dt_golf 
(id,"OUTLOOK",temperature,humidity,"Cont,features",cat_features, windy,class) 
VALUES
+(1, 'sunny', 85, 85,ARRAY[85, 85], ARRAY['a', 'b'], false, 'Don''t Play'),
+(2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['a', 'b'], true, 'Don''t Play'),
+(3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['a', 'b'], false, 'Play'),
+(4, 'rain', 70, NULL, ARRAY[70, 96], ARRAY['a', 'b'], false, 'Play'),
+(5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['a', 'b'], false, 'Play'),
+(6, 'rain', NULL, 70, ARRAY[65, 70], ARRAY['a', 'b'], true, 'Don''t Play'),
+(7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['c', 'b'], NULL , 'Play'),
+(8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['a', 'b'], false, 'Don''t Play'),
+(9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['a', 'b'], false, 'Play'),
+(10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['a', 'b'], false, 'Play'),
+(11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['a', 'd'], true, 'Play'),
+(12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['c', 'b'], NULL, 'Play'),
+(13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['a', 'b'], false, 'Play'),
+(15, NULL, 81, 75, ARRAY[81, 75], ARRAY['a', 'b'], false, 'Play'),
+(16, 'overcast', NULL, 75, ARRAY[81, 75], ARRAY['a', 'd'], false, 'Play'),
+(14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['c', 'b'], true, 'Don''t Play');
+
+SELECT dropcols('dt_golf', 'dt_golf2', '"OUTLOOK", "Cont,features", 
cat_features');
+-- test if the retained columns are present in output table
+SELECT
+id, id_2, temperature, humidity, windy, class
--- End diff --

maybe we should also test that the dropped columns were indeed dropped from 
the output table ?


---


[GitHub] madlib pull request #282: Utilites: Add CTAS while dropping some columns

2018-06-28 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/282#discussion_r199002193
  
--- Diff: src/ports/postgres/modules/utilities/test/utilities.ic.sql_in ---
@@ -0,0 +1,58 @@
+/* --- 
*/
+/**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ *//* 
--- */
+
+-- cleanup
+SELECT cleanup_madlib_temp_tables(quote_ident(current_schema()));
--- End diff --

I am assuming that we moved this test from the `drop_madlib_temp.ic.sql_in` 
file to this generic utilities install check test, right ?


---


[GitHub] madlib pull request #282: Utilites: Add CTAS while dropping some columns

2018-06-28 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/282#discussion_r199005872
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -853,6 +857,34 @@ def validate_module_input_params(source_table, 
output_table, independent_varname
source_table=source_table))
 # 
 
+
+def create_table_drop_cols(source_table, out_table,
--- End diff --

If the client_min_messages is set to NOTICE, calling dropcols prints the 
create table query. We should use thje context manager to silence everything 
except ERROR messages. 


---


[GitHub] madlib pull request #283: Bugfix: Fix failing dev check in CRF

2018-06-27 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/283#discussion_r198641216
  
--- Diff: src/ports/postgres/modules/crf/test/crf_train_large.sql_in ---
@@ -234,26 +234,40 @@ INSERT INTO train_new_segmenttbl VALUES
 (30, 7, 'years', 13, 31),
 (31, 7, '.', 44, 31);
 
-CREATE TABLE train_new_regex(pattern text,name text); 
+CREATE TABLE train_new_regex(pattern text,name text);
 INSERT INTO train_new_regex VALUES
-('^[A-Z][a-z]+$','InitCapital'), ('^[A-Z]+$','isAllCapital'),
+('^[A-Z][a-z]+$','InitCapital'), ('^[A-Z]+$','isAllCapital'),
 ('^.*[0-9]+.*$','containsDigit'),('^.+[.]$','endsWithDot'),
 ('^.+[,]$','endsWithComma'), ('^.+er$','endsWithER'),
 ('^.+est$','endsWithEst'),   ('^.+ed$','endsWithED'),
 ('^.+s$','endsWithS'),   ('^.+ing$','endsWithIng'),
 ('^.+ly$','endsWithly'), 
('^.+-.+$','isDashSeparatedWords'),
 ('^.*@.*$','isEmailId');
-analyze train_new_regex;
+analyze train_new_regex;
 
-SELECT crf_train_fgen('train_new_segmenttbl', 'train_new_regex', 
'crf_label', 'train_new_dictionary', 
'train_new_featuretbl','train_new_featureset');
+CREATE TABLE crf_label_new (id integer,label character varying);
--- End diff --

The two files `crf_test_small.sql_in` and `crf_train_large.sql_in` have 
different indentation. Can we make them consistent


---


[GitHub] madlib pull request #283: Bugfix: Fix failing dev check in CRF

2018-06-27 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/283#discussion_r198644242
  
--- Diff: src/ports/postgres/modules/crf/test/crf_test_small.sql_in ---
@@ -90,7 +90,7 @@
 (18,'PRP$'),(19,'RB'), (20,'RBR'),  (21,'RBS'), (22,'RP'), 
(23,'SYM'), (24,'TO'), (25,'UH'), (26,'VB'),
 (27,'VBD'), (28,'VBG'),(29,'VBN'),  (30,'VBP'), 
(31,'VBZ'),(32,'WDT'), (33,'WP'), (34,'WP$'),(35,'WRB'),
 (36,'$'),   (37,'#'),  (38,''), (39,'``'),  (40,'('),  
(41,')'),   (42,','),  (43,'.'),  (44,':');
-   analyze crf_label;
+   analyze test_crf_label;
--- End diff --

Assuming that the table `crf_label` doesn't exist, why wasn't crf install 
check always red? 


---


[GitHub] madlib pull request #281: Support special characters in MLP, minibatch prepr...

2018-06-21 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/281

Support special characters in MLP, minibatch preprocessor and 
encode_categorical

Support special characters in MLP, minibatch preprocessor and 
encode_categorical

JIRA: MADLIB-1237
JIRA: MADLIB-1238
JIRA: MADLIB-1238
JIRA: MADLIB-1243

The module that needs to support special characters will have to call
quote_literal() on all the column values that need to be escaped and
quoted and then this list can be passed to the py_list_to_sql_string
function

We also created a function called get_distinct_col_levels which will
call quote_literal and then return a list of escaped column levels. The
output of this function can then be safely passed to
py_list_to_sql_string with long_format set as True.

Co-Authored-by: Jingyi Mei 
Co-Authored-by: Rahul Iyer 
Co-Authored-by: Arvind Sridhar 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bug_minibatch_preprocessor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/281.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #281


commit 4713b24eac1c27ba09cd1152e502b02bb1e13da4
Author: Jingyi Mei 
Date:   2018-05-23T23:29:54Z

MLP+Minibatch Preprocessing: Support special characters

JIRA: MADLIB-1237
JIRA: MADLIB-1238

This commit enables special character support for column names and
column values for mlp and minibatch preprocessor. We decided to use the
following strategy for supporting special characters

The module that needs to support special characters will have to call
quote_literal() on all the column values that need to be escaped and
quoted and then this list can be passed to the py_list_to_sql_string
function

We also created a function called get_distinct_col_levels which will
call quote_literal and then return a list of escaped column levels. The
output of this function can then be safely passed to
py_list_to_sql_string with long_format set as True.

Co-Authored-by: Jingyi Mei 
Co-Authored-by: Rahul Iyer 
Co-Authored-by: Arvind Sridhar 

commit d24cdfe1dbdcfe8ba2379a70a52cafeeba994c0e
Author: Arvind Sridhar 
Date:   2018-05-24T00:02:43Z

Encode categorical variables: handling special characters

JIRA: MADLIB-1238
JIRA: MADLIB-1243

This commit deals with special characters in column name and column
values. Also adds install check test cases to cover these scenarios.

Co-Authored-by: Jingyi Mei 
Co-Authored-by: Arvind Sridhar 

commit 262e796a9cb17c612c1e844ee7354be1abd11f5d
Author: Nikhil Kak 
Date:   2018-06-18T21:30:20Z

Cleanup: Remove unnecessary unit tests.

All the unit tests in utilties.py_in were moved to test_utilities.py_in.




---


[GitHub] madlib pull request #279: KNN: make install check asserts deterministic

2018-06-14 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/279

KNN: make install check asserts deterministic

knn install check had a couple of array_agg asserts which were not
deterministic. This commit adds order by to the array agg function to
make it deterministic.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib bugfix/knn_ic

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #279


commit c117afa614bb84c4cc36b466934247e51d2eff4e
Author: Nikhil Kak 
Date:   2018-06-14T17:20:55Z

KNN: make install check asserts deterministic

knn install check had a couple of array_agg asserts which were not
deterministic. This commit adds order by to the array agg function to
make it derterministic.




---


[GitHub] madlib issue #278: Upgrade: add changelist file and bug fixes and changelist...

2018-06-13 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/278
  
@iyerr3 yes we will have to rename it. 


---


[GitHub] madlib pull request #278: Upgrade: add changelist file and bug fixes and cha...

2018-06-11 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/278

Upgrade: add changelist file and bug fixes and changelist file

This PR  adds a changelist file for testing 1.14 to 1.15 upgrade and 
addresses 2 bugs we found during upgrade testing.
1. Commit 8e34f68 added a new function called `_write_to_file` that takes 2 
arguments.
Some of the calls to this function were not passing the first file handle 
argument.
2. Append schema_madlib to the mlp_igd_final return type. This caused the
upgrade to fail from 1.12 to 1.x if there was a dependency on
mlp_igd_final.

Co-authored-by : Orhan Kislal 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
feature/automate_upgrade_failure_sql

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #278


commit b011d8eb261f7b42b17a45fa9d881a4bebb69960
Author: Nikhil Kak 
Date:   2018-06-06T23:35:22Z

Upgrade: Fix 1.12 to 1.13 changelist

Append schema_madlib to the mlp_igd_final return type. This caused the
upgrade to fail from 1.12 to 1.x if there was a dependency on
mlp_igd_final.

Co-authored-by : Orhan Kislal 

commit 979036528167a9ff9d1352550c5d4650504efe72
Author: Nikhil Kak 
Date:   2018-06-07T17:54:30Z

Upgrade: Add empty changelist for 1.14 to 1.15-dev

This changelist will be filled in before the 1.15 release.  We will
need to rename this during the 1.15 release from 1.14_1.15-dev.yaml to
1.14_1.15.yaml.

commit 6d0cde5cfaa2fe6051d02ee2e5add3f77c344067
Author: Nikhil Kak 
Date:   2018-06-07T19:00:58Z

Upgrade: Fix missing filehandle arg from _write_to_file calls

Commit 8e34f68d7f7b902ebe511b934683724dc2d5dcca added a new function called 
`_write_to_file` that takes 2 arguments.
Some of the calls to this function were not passing the first file handle 
argument.




---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191945714
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -197,6 +244,7 @@ MLP::loss(
 const model_type,
 const independent_variables_type,
 const dependent_variable_type   _true) {
+
--- End diff --

+1 will do


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191942290
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1781,7 +1799,7 @@ class MLPMinibatchPreProcessor:
 summary_table_columns = summary_table_columns[0]
 
 required_columns = (self.DEPENDENT_VARNAME, 
self.INDEPENDENT_VARNAME,
-self.CLASS_VALUES, self.GROUPING_COL)
+self.CLASS_VALUES, self.GROUPING_COL, 
self.DEPENDENT_VARTYPE)
 if set(required_columns) <= set(summary_table_columns):
 self.preprocessed_summary_dict = summary_table_columns
 else:
--- End diff --

+1 will do


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/272#discussion_r191942284
  
--- Diff: src/ports/postgres/modules/convex/mlp.sql_in ---
@@ -1474,13 +1480,15 @@ CREATE AGGREGATE MADLIB_SCHEMA.mlp_minibatch_step(
 /* warm_start_coeff */DOUBLE PRECISION[],
 /* lambda */  DOUBLE PRECISION,
 /* batch_size */  INTEGER,
-/* n_epochs */INTEGER
+/* n_epochs */INTEGER,
+/* momentum */DOUBLE PRECISION,
+/* is_nesterov */ BOOLEAN
 )(
 STYPE=DOUBLE PRECISION[],
 SFUNC=MADLIB_SCHEMA.mlp_minibatch_transition,
 m4_ifdef(`__POSTGRESQL__', `', 
`prefunc=MADLIB_SCHEMA.mlp_minibatch_merge,')
 FINALFUNC=MADLIB_SCHEMA.mlp_minibatch_final,
-INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0}'
+INITCOND='{0,0,0,0,0,0,0,0,0,0,0,0,0,0}'
 );
 -
 
--- End diff --

+1 will do


---


[GitHub] madlib issue #271: Madpack: Make install, reinstall and upgrade atomic

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/271
  
A few other observations from madpack output

1. While uninstalling , this message is printed on the console’
`madpack.py: INFO : Installing MADlib modules` We should instead say that 
we are uninstalling modules.

1. In case of failure, the output looks something like `MADlib upgrade 
unsuccessful.` It would be nice to also say that everything was rolled back.

1. I also noticed that the output messages have changed a bit for 
install/uninstall/reinstall and upgrade

old output 
```
madpack.py: INFO : Installing MADlib into MADLIB schema...
madpack.py: INFO : > Creating MADLIB schema
madpack.py: INFO : > Creating MADLIB.MigrationHistory table
madpack.py: INFO : > Writing version info in MigrationHistory table
madpack.py: INFO : > Creating objects for modules:
madpack.py: INFO : > - array_ops
madpack.py: INFO : > - bayes
...
...
madpack.py: INFO : > - validation
madpack.py: INFO : MADlib 1.13 installed successfully in MADLIB schema.
``` 

new output
```
madpack.py: INFO : Testing PL/Python environment...
madpack.py: INFO : > Creating language PL/Python...
madpack.py: INFO : > PL/Python environment OK (version: 2.7.14)
madpack.py: INFO : > Preparing objects for modules:
madpack.py: INFO : > - array_ops
madpack.py: INFO : > - bayes
...
...
madpack.py: INFO : > - validation
madpack.py: INFO : Installing MADlib modules...
madpack.py: INFO : > Created MADLIB schema
madpack.py: INFO : > Created MADLIB.MigrationHistory table
madpack.py: INFO : > Wrote version info in MigrationHistory table
madpack.py: INFO : MADlib 1.15-dev installed successfully in MADLIB schema.
```
I think the previous message was better because all the modules were 
printed after `Installing MADlib into MADLIB schema` instead of after 
`Preparing objects for modules`. And with the new output`Installing MADlib 
modules...` is followed by `Created madlib schema` which looks a bit weird.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191608985
  
--- Diff: src/madpack/madpack.py ---
@@ -824,6 +873,246 @@ def parse_arguments():
 # Get the arguments
 return parser.parse_args()
 
+def run_install_check(args, testcase):
+schema = args['schema']
+dbrev = args['dbrev']
+# 1) Compare OS and DB versions. Continue if OS = DB.
+if get_rev_num(dbrev) != get_rev_num(rev):
+_print_revs(rev, dbrev, con_args, schema)
+info_(this, "Versions do not match. Install-check stopped.", True)
+return
+
+# Create install-check user
+test_user = ('madlib_' +
+ rev.replace('.', '').replace('-', '_') +
+ '_installcheck')
+try:
+_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), False)
+except:
+_internal_run_query("DROP OWNED BY %s CASCADE;" % (test_user), 
True)
+_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), True)
+_internal_run_query("CREATE USER %s;" % (test_user), True)
+
+_internal_run_query("GRANT USAGE ON SCHEMA %s TO %s;" % (schema, 
test_user), True)
+
+# 2) Run test SQLs
+info_(this, "> Running test scripts for:", verbose)
+
+caseset = (set([test.strip() for test in testcase.split(',')])
+   if testcase != "" else set())
+
+modset = {}
+for case in caseset:
+if case.find('/') > -1:
+[mod, algo] = case.split('/')
+if mod not in modset:
+modset[mod] = []
+if algo not in modset[mod]:
+modset[mod].append(algo)
+else:
+modset[case] = []
+
+# Loop through all modules
+for moduleinfo in portspecs['modules']:
+
+# Get module name
+module = moduleinfo['name']
+
+# Skip if doesn't meet specified modules
+if modset is not None and len(modset) > 0 and module not in modset:
+continue
+# JIRA: MADLIB-1078 fix
+# Skip pmml during install-check (when run without the -t option).
+# We can still run install-check on pmml with '-t' option.
+if not modset and module in ['pmml']:
+continue
+info_(this, "> - %s" % module, verbose)
+
+# Make a temp dir for this module (if doesn't exist)
+cur_tmpdir = tmpdir + '/' + module + '/test'  # tmpdir is a global 
variable
+_make_dir(cur_tmpdir)
+
+# Find the Python module dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/" + dbver + 
"/modules/" + module):
+maddir_mod_py = maddir + "/ports/" + portid + "/" + dbver + 
"/modules"
+else:
+maddir_mod_py = maddir + "/modules"
+
+# Find the SQL module dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/modules/" + 
module):
+maddir_mod_sql = maddir + "/ports/" + portid + "/modules"
+else:
+maddir_mod_sql = maddir + "/modules"
+
+# Prepare test schema
+test_schema = "madlib_installcheck_%s" % (module)
+_internal_run_query("DROP SCHEMA IF EXISTS %s CASCADE; CREATE 
SCHEMA %s;" %
+(test_schema, test_schema), True)
+_internal_run_query("GRANT ALL ON SCHEMA %s TO %s;" %
+(test_schema, test_user), True)
+
+# Switch to test user and prepare the search_path
+pre_sql = '-- Switch to test user:\n' \
+  'SET ROLE %s;\n' \
+  '-- Set SEARCH_PATH for install-check:\n' \
+  'SET search_path=%s,%s;\n' \
+  % (test_user, test_schema, schema)
+
+# Loop through all test SQL files for this module
+sql_files = maddir_mod_sql + '/' + module + '/test/*.sql_in'
+for sqlfile in sorted(glob.glob(sql_files), reverse=True):
+algoname = os.path.basename(sqlfile).split('.')[0]
+# run only algo specified
+if (module in modset and modset[module] and
+algoname not in modset[module]):
+continue
+
+# Set file names
+tmpfile = cur_tmpdir + '/' + os.path.basename(sqlfile) + '.tmp'
+   

[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191595833
  
--- Diff: src/madpack/madpack.py ---
@@ -559,71 +650,59 @@ def _db_rename_schema(from_schema, to_schema):
 # 
--
 
 
-def _db_create_schema(schema):
+def _db_create_schema(schema, is_schema_in_db, filehandle):
 """
 Create schema
 @param from_schema name of the schema to rename
+@param is_schema_in_db flag to indicate if schema is already 
present
 @param to_schema new name for the schema
 """
 
-info_(this, "> Creating %s schema" % schema.upper(), True)
--- End diff --

Does it add value to keep this info statement ?


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191601387
  
--- Diff: src/madpack/madpack.py ---
@@ -987,275 +1276,42 @@ def main(argv):
 error_(this, "Missing -p/--platform parameter.", True)
 if not con_args:
 error_(this, "Unknown problem with database connection string: 
%s" % con_args, True)
+#  Completed "Get and validate arguments" 
-
 
 # COMMAND: version
 if args.command[0] == 'version':
 _print_revs(rev, dbrev, con_args, schema)
 
-# COMMAND: uninstall/reinstall
-if args.command[0] in ('uninstall', 'reinstall'):
-if get_rev_num(dbrev) == [0]:
-info_(this, "Nothing to uninstall. No version found in schema 
%s." % schema.upper(), True)
-return
-
-# Find any potential data to lose
-affected_objects = _internal_run_query("""
-SELECT
-n1.nspname AS schema,
-relname AS relation,
-attname AS column,
-typname AS type
-FROM
-pg_attribute a,
-pg_class c,
-pg_type t,
-pg_namespace n,
-pg_namespace n1
-WHERE
-n.nspname = '%s'
-AND t.typnamespace = n.oid
-AND a.atttypid = t.oid
-AND c.oid = a.attrelid
-AND c.relnamespace = n1.oid
-AND c.relkind = 'r'
-ORDER BY
-n1.nspname, relname, attname, typname""" % schema.lower(), 
True)
-
-info_(this, "*** Uninstalling MADlib ***", True)
-info_(this, 
"***",
 True)
-info_(this, "* Schema %s and all database objects depending on it 
will be dropped!" % schema.upper(), True)
-if affected_objects:
-info_(this, "* If you continue the following data will be lost 
(schema : table.column : type):", True)
-for ao in affected_objects:
-info_(this, '* - ' + ao['schema'] + ' : ' + ao['relation'] 
+ '.' +
-  ao['column'] + ' : ' + ao['type'], True)
-info_(this, 
"***",
 True)
-info_(this, "Would you like to continue? [Y/N]", True)
-go = raw_input('>>> ').upper()
-while go != 'Y' and go != 'N':
-go = raw_input('Yes or No >>> ').upper()
-
-# 2) Do the uninstall/drop
-if go == 'N':
-info_(this, 'No problem. Nothing dropped.', True)
-return
-
-elif go == 'Y':
-info_(this, "> dropping schema %s" % schema.upper(), verbose)
-try:
-_internal_run_query("DROP SCHEMA %s CASCADE;" % (schema), 
True)
-except:
-error_(this, "Cannot drop schema %s." % schema.upper(), 
True)
-
-info_(this, 'Schema %s (and all dependent objects) has been 
dropped.' % schema.upper(), True)
-info_(this, 'MADlib uninstalled successfully.', True)
-
-else:
-return
-
-# COMMAND: install/reinstall
-if args.command[0] in ('install', 'reinstall'):
-# Refresh MADlib version in DB, None for GP/PG
-if args.command[0] == 'reinstall':
-print "Setting MADlib database version to be None for 
reinstall"
-dbrev = None
-
-info_(this, "*** Installing MADlib ***", True)
-
-# 1) Compare OS and DB versions.
-# noop if OS <= DB.
-_print_revs(rev, dbrev, con_args, schema)
-if is_rev_gte(get_rev_num(dbrev), get_rev_num(rev)):
-info_(this, "Current MADlib version already up to date.", True)
-return
-# proceed to create objects if nothing installed in DB
-elif dbrev is None:
-pass
-# error and refer to upgrade if OS > DB
-else:
-error_(this, """Aborting installation: existing MADlib version 
detected in {0} schema
-To upgrade the {0} schema to MADlib v{1} please run 
the following command:
-madpack upgrade -s {0} -p {2} [-c ...]
-""".forma

[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191604335
  
--- Diff: src/madpack/upgrade_util.py ---
@@ -1299,18 +1303,19 @@ def _clean_function(self):
 pattern = re.compile(r"""CREATE(\s+)FUNCTION""", re.DOTALL | 
re.IGNORECASE)
 self._sql = re.sub(pattern, 'CREATE OR REPLACE FUNCTION', 
self._sql)
 
-def cleanup(self, sql):
+def cleanup(self, sql, algoname):
 """
 @brief Entry function for cleaning the sql script
 """
 self._sql = sql
-self._clean_comment()
-self._clean_type()
-self._clean_cast()
-self._clean_operator()
-self._clean_opclass()
-self._clean_aggregate()
-self._clean_function()
+if algoname not in self.get_change_handler().newmodule:
--- End diff --

can we switch the if logic to 
```
if algoname in  self.get_change_handler().newmodule:
return self._sql
```

also it is not really clear why we need the if check in the first place. 
Can you explain it in a comment ?


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191592112
  
--- Diff: src/madpack/madpack.py ---
@@ -238,6 +311,88 @@ def _run_sql_file(schema, maddir_mod_py, module, 
sqlfile,
 return retval
 # 
--
 
+def _run_sql_file(schema, sqlfile):
+"""
+Run SQL file
+@param schema name of the target schema
+@param sqlfile name of the file to parse
+"""
+# Run the SQL using DB command-line utility
+if portid in SUPPORTED_PORTS:
+sqlcmd = 'psql'
+# Test the DB cmd line utility
+std, err = subprocess.Popen(['which', sqlcmd], 
stdout=subprocess.PIPE,
+stderr=subprocess.PIPE).communicate()
+if not std:
+error_(this, "Command not found: %s" % sqlcmd, True)
+
+runcmd = [sqlcmd, '-a',
+  '-v', 'ON_ERROR_STOP=1',
+  '-h', con_args['host'].split(':')[0],
+  '-p', con_args['host'].split(':')[1],
+  '-d', con_args['database'],
+  '-U', con_args['user'],
+  '--no-password',
+  '--single-transaction',
+  '-f', sqlfile]
+runenv = os.environ
+if 'password' in con_args:
+runenv["PGPASSWORD"] = con_args['password']
+runenv["PGOPTIONS"] = '-c client_min_messages=notice'
+
+# Open log file
+logfile = sqlfile + '.log'
+try:
+log = open(logfile, 'w')
+except:
+error_(this, "Cannot create log file: %s" % logfile, False)
+raise Exception
+
+# Run the SQL
+try:
+info_(this, "> ... executing " + sqlfile, verbose)
+info_(this, ' '.join(runcmd), verbose)
+retval = subprocess.call(runcmd, env=runenv, stdout=log, 
stderr=log)
+except:
+error_(this, "Failed executing %s" % sqlfile, False)
+raise Exception
+finally:
+log.close()
+# Check the exit status
+result = _parse_result_logfile(retval, logfile, sqlfile)
+return result
+# 
--
+
+def _parse_result_logfile(retval, logfile, sql_abspath,
+  sql_filename=None, module=None, 
milliseconds=None):
+"""
+Function to parse the logfile and return if its content indicate a 
failure
+or success.
+"""
+is_install_check_logfile = bool(sql_filename and module)
+# Check the exit status
+if retval != 0:
+result = 'FAIL'
+global keeplogs
+keeplogs = True
+# Since every single statement in the test file gets logged,
+# an empty log file indicates an empty or a failed test
+elif os.path.isfile(logfile) and os.path.getsize(logfile) > 0:
+result = 'PASS'
+# Otherwise
+else:
+result = 'ERROR'
+
+if is_install_check_logfile:
--- End diff --

can we move the `install_check` logic out of this function. The function 
already returns the `result`, we can use this `result` variable to print the 
install check output. 


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191587755
  
--- Diff: src/madpack/madpack.py ---
@@ -131,10 +141,73 @@ def _get_relative_maddir(maddir, port):
 return maddir
 # 
--
 
+def _cleanup_comments_in_sqlfile(output_filename, upgrade):
+"""
+@brief Remove comments in the sql script
+"""
+if not upgrade:
+with open(output_filename, 'r+') as output_filehandle:
+full_sql = output_filehandle.read()
+pattern = 
re.compile(r"""(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))""")
+res = ''
+lines = re.split(r'[\r\n]+', full_sql)
+for line in lines:
+tmp = line
+if not tmp.strip().startswith("E'"):
+line = re.sub(pattern, '', line)
+res += line + '\n'
+full_sql = res.strip()
+full_sql = re.sub(pattern, '', full_sql).strip()
+# Re-write the cleaned-up sql to a new file. Python does not let us
--- End diff --

can we move the new file creation and the renaming logic to a different 
function? This way the function will have a single responsibility of just 
cleaning the input.




---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191580503
  
--- Diff: src/madpack/utilities.py ---
@@ -33,6 +33,23 @@
 this = os.path.basename(sys.argv[0])# name of this script
 
 
+class AtomicFileOpen:
--- End diff --

do we need this `AtomicFileOpen`class ? I don't think it's used anywhere


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191609335
  
--- Diff: src/madpack/madpack.py ---
@@ -537,9 +629,8 @@ def _db_upgrade(schema, dbrev):
 ch.drop_changed_udf()
 ch.drop_changed_udt()  # assume dependent udf for udt does not change
 ch.drop_traininginfo_4dt()  # used types: oid, text, integer, float
-_db_create_objects(schema, None, True, sc)
-
-info_(this, "MADlib %s upgraded successfully in %s schema." % 
(str(rev), schema.upper()), True)
+_db_create_objects(schema, filehandle, True, sc)
--- End diff --

can we rewrite this as 
```python
_db_create_objects(schema, filehandle, upgrade=True, sc=sc)
```


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191587596
  
--- Diff: src/madpack/madpack.py ---
@@ -131,10 +141,73 @@ def _get_relative_maddir(maddir, port):
 return maddir
 # 
--
 
+def _cleanup_comments_in_sqlfile(output_filename, upgrade):
+"""
+@brief Remove comments in the sql script
+"""
+if not upgrade:
+with open(output_filename, 'r+') as output_filehandle:
+full_sql = output_filehandle.read()
+pattern = 
re.compile(r"""(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))""")
+res = ''
+lines = re.split(r'[\r\n]+', full_sql)
+for line in lines:
+tmp = line
+if not tmp.strip().startswith("E'"):
+line = re.sub(pattern, '', line)
+res += line + '\n'
+full_sql = res.strip()
+full_sql = re.sub(pattern, '', full_sql).strip()
+# Re-write the cleaned-up sql to a new file. Python does not let us
--- End diff --

can we move the new file creation and the renaming logic to a different 
function? This way the function will have a single responsibility of just 
cleaning the input.




---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191589562
  
--- Diff: src/madpack/madpack.py ---
@@ -238,6 +311,88 @@ def _run_sql_file(schema, maddir_mod_py, module, 
sqlfile,
 return retval
 # 
--
 
+def _run_sql_file(schema, sqlfile):
--- End diff --

The three functions `_run_sql_file` , `_run_sql_file_install_check` and 
`_run_m4_and_append` have some code logic duplicated related to running m4 and 
running the sql file. Can we refactor this to not have any duplicated code ?


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191603261
  
--- Diff: src/madpack/madpack.py ---
@@ -131,10 +141,73 @@ def _get_relative_maddir(maddir, port):
 return maddir
 # 
--
 
+def _cleanup_comments_in_sqlfile(output_filename, upgrade):
+"""
+@brief Remove comments in the sql script
+"""
+if not upgrade:
+with open(output_filename, 'r+') as output_filehandle:
+full_sql = output_filehandle.read()
+pattern = 
re.compile(r"""(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))""")
+res = ''
+lines = re.split(r'[\r\n]+', full_sql)
+for line in lines:
+tmp = line
+if not tmp.strip().startswith("E'"):
+line = re.sub(pattern, '', line)
+res += line + '\n'
+full_sql = res.strip()
+full_sql = re.sub(pattern, '', full_sql).strip()
+# Re-write the cleaned-up sql to a new file. Python does not let us
+# erase all the content of a file and rewrite the same file again.
+cleaned_output_filename = output_filename+'.tmp'
+with open(cleaned_output_filename, 'w') as output_filehandle:
+_write_to_file(output_filehandle, full_sql)
+# Move the cleaned output file to the old one.
+os.rename(cleaned_output_filename, output_filename)
+
+def _run_m4_and_append(schema, maddir_mod_py, module, sqlfile,
+   output_filehandle, pre_sql=None):
+"""
+Function to process a sql file with M4.
+"""
+# Check if the SQL file exists
+if not os.path.isfile(sqlfile):
+error_(this, "Missing module SQL file (%s)" % sqlfile, False)
+raise ValueError("Missing module SQL file (%s)" % sqlfile)
 
-def _run_sql_file(schema, maddir_mod_py, module, sqlfile,
-  tmpfile, logfile, pre_sql, upgrade=False,
-  sc=None):
+# Prepare the file using M4
+try:
+# Add the before SQL
+if pre_sql:
+output_filehandle.writelines([pre_sql, '\n\n'])
+# Find the madpack dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/" + dbver + 
"/madpack"):
+maddir_madpack = maddir + "/ports/" + portid + "/" + dbver + 
"/madpack"
+else:
+maddir_madpack = maddir + "/madpack"
+maddir_ext_py = maddir + "/lib/python"
+
+m4args = ['m4',
+  '-P',
+  '-DMADLIB_SCHEMA=' + schema,
+  '-DPLPYTHON_LIBDIR=' + maddir_mod_py,
+  '-DEXT_PYTHON_LIBDIR=' + maddir_ext_py,
+  '-DMODULE_PATHNAME=' + maddir_lib,
+  '-DMODULE_NAME=' + module,
+  '-I' + maddir_madpack,
+  sqlfile]
+
+info_(this, "> ... parsing: " + " ".join(m4args), verbose)
+output_filehandle.flush()
+subprocess.call(m4args, stdout=output_filehandle)
+except:
+error_(this, "Failed executing m4 on %s" % sqlfile, False)
+raise Exception
+
+def _run_sql_file_install_check(schema, maddir_mod_py, module, sqlfile,
+tmpfile, logfile, pre_sql, upgrade=False,
+sc=None):
--- End diff --

we don't really need the two optional params since 
`_run_sql_file_install_check` is only called once.


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191608639
  
--- Diff: src/madpack/madpack.py ---
@@ -824,6 +873,246 @@ def parse_arguments():
 # Get the arguments
 return parser.parse_args()
 
+def run_install_check(args, testcase):
+schema = args['schema']
+dbrev = args['dbrev']
+# 1) Compare OS and DB versions. Continue if OS = DB.
+if get_rev_num(dbrev) != get_rev_num(rev):
+_print_revs(rev, dbrev, con_args, schema)
+info_(this, "Versions do not match. Install-check stopped.", True)
+return
+
+# Create install-check user
+test_user = ('madlib_' +
+ rev.replace('.', '').replace('-', '_') +
+ '_installcheck')
+try:
+_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), False)
+except:
+_internal_run_query("DROP OWNED BY %s CASCADE;" % (test_user), 
True)
+_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), True)
+_internal_run_query("CREATE USER %s;" % (test_user), True)
+
+_internal_run_query("GRANT USAGE ON SCHEMA %s TO %s;" % (schema, 
test_user), True)
+
+# 2) Run test SQLs
+info_(this, "> Running test scripts for:", verbose)
+
+caseset = (set([test.strip() for test in testcase.split(',')])
+   if testcase != "" else set())
+
+modset = {}
+for case in caseset:
+if case.find('/') > -1:
+[mod, algo] = case.split('/')
+if mod not in modset:
+modset[mod] = []
+if algo not in modset[mod]:
+modset[mod].append(algo)
+else:
+modset[case] = []
+
+# Loop through all modules
+for moduleinfo in portspecs['modules']:
+
+# Get module name
+module = moduleinfo['name']
+
+# Skip if doesn't meet specified modules
+if modset is not None and len(modset) > 0 and module not in modset:
+continue
+# JIRA: MADLIB-1078 fix
+# Skip pmml during install-check (when run without the -t option).
+# We can still run install-check on pmml with '-t' option.
+if not modset and module in ['pmml']:
+continue
+info_(this, "> - %s" % module, verbose)
+
+# Make a temp dir for this module (if doesn't exist)
+cur_tmpdir = tmpdir + '/' + module + '/test'  # tmpdir is a global 
variable
+_make_dir(cur_tmpdir)
+
+# Find the Python module dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/" + dbver + 
"/modules/" + module):
+maddir_mod_py = maddir + "/ports/" + portid + "/" + dbver + 
"/modules"
+else:
+maddir_mod_py = maddir + "/modules"
+
+# Find the SQL module dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/modules/" + 
module):
+maddir_mod_sql = maddir + "/ports/" + portid + "/modules"
+else:
+maddir_mod_sql = maddir + "/modules"
+
+# Prepare test schema
+test_schema = "madlib_installcheck_%s" % (module)
+_internal_run_query("DROP SCHEMA IF EXISTS %s CASCADE; CREATE 
SCHEMA %s;" %
+(test_schema, test_schema), True)
+_internal_run_query("GRANT ALL ON SCHEMA %s TO %s;" %
+(test_schema, test_user), True)
+
+# Switch to test user and prepare the search_path
+pre_sql = '-- Switch to test user:\n' \
+  'SET ROLE %s;\n' \
+  '-- Set SEARCH_PATH for install-check:\n' \
+  'SET search_path=%s,%s;\n' \
+  % (test_user, test_schema, schema)
+
+# Loop through all test SQL files for this module
+sql_files = maddir_mod_sql + '/' + module + '/test/*.sql_in'
+for sqlfile in sorted(glob.glob(sql_files), reverse=True):
+algoname = os.path.basename(sqlfile).split('.')[0]
+# run only algo specified
+if (module in modset and modset[module] and
+algoname not in modset[module]):
+continue
+
+# Set file names
+tmpfile = cur_tmpdir + '/' + os.path.basename(sqlfile) + '.tmp'
+   

[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191588071
  
--- Diff: src/madpack/madpack.py ---
@@ -131,10 +141,73 @@ def _get_relative_maddir(maddir, port):
 return maddir
 # 
--
 
+def _cleanup_comments_in_sqlfile(output_filename, upgrade):
+"""
+@brief Remove comments in the sql script
+"""
+if not upgrade:
+with open(output_filename, 'r+') as output_filehandle:
+full_sql = output_filehandle.read()
+pattern = 
re.compile(r"""(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))""")
+res = ''
+lines = re.split(r'[\r\n]+', full_sql)
+for line in lines:
+tmp = line
+if not tmp.strip().startswith("E'"):
+line = re.sub(pattern, '', line)
+res += line + '\n'
+full_sql = res.strip()
+full_sql = re.sub(pattern, '', full_sql).strip()
+# Re-write the cleaned-up sql to a new file. Python does not let us
+# erase all the content of a file and rewrite the same file again.
+cleaned_output_filename = output_filename+'.tmp'
+with open(cleaned_output_filename, 'w') as output_filehandle:
+_write_to_file(output_filehandle, full_sql)
+# Move the cleaned output file to the old one.
+os.rename(cleaned_output_filename, output_filename)
+
+def _run_m4_and_append(schema, maddir_mod_py, module, sqlfile,
+   output_filehandle, pre_sql=None):
+"""
+Function to process a sql file with M4.
+"""
+# Check if the SQL file exists
+if not os.path.isfile(sqlfile):
+error_(this, "Missing module SQL file (%s)" % sqlfile, False)
--- End diff --

why do we need to call `error_`, isn't `ValueError` enough ?


---


[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191607552
  
--- Diff: src/madpack/madpack.py ---
@@ -824,6 +873,246 @@ def parse_arguments():
 # Get the arguments
 return parser.parse_args()
 
+def run_install_check(args, testcase):
+schema = args['schema']
+dbrev = args['dbrev']
+# 1) Compare OS and DB versions. Continue if OS = DB.
+if get_rev_num(dbrev) != get_rev_num(rev):
+_print_revs(rev, dbrev, con_args, schema)
+info_(this, "Versions do not match. Install-check stopped.", True)
+return
+
+# Create install-check user
+test_user = ('madlib_' +
+ rev.replace('.', '').replace('-', '_') +
+ '_installcheck')
+try:
+_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), False)
+except:
+_internal_run_query("DROP OWNED BY %s CASCADE;" % (test_user), 
True)
+_internal_run_query("DROP USER IF EXISTS %s;" % (test_user), True)
+_internal_run_query("CREATE USER %s;" % (test_user), True)
+
+_internal_run_query("GRANT USAGE ON SCHEMA %s TO %s;" % (schema, 
test_user), True)
+
+# 2) Run test SQLs
+info_(this, "> Running test scripts for:", verbose)
+
+caseset = (set([test.strip() for test in testcase.split(',')])
+   if testcase != "" else set())
+
+modset = {}
+for case in caseset:
+if case.find('/') > -1:
+[mod, algo] = case.split('/')
+if mod not in modset:
+modset[mod] = []
+if algo not in modset[mod]:
+modset[mod].append(algo)
+else:
+modset[case] = []
+
+# Loop through all modules
+for moduleinfo in portspecs['modules']:
+
+# Get module name
+module = moduleinfo['name']
+
+# Skip if doesn't meet specified modules
+if modset is not None and len(modset) > 0 and module not in modset:
+continue
+# JIRA: MADLIB-1078 fix
+# Skip pmml during install-check (when run without the -t option).
+# We can still run install-check on pmml with '-t' option.
+if not modset and module in ['pmml']:
+continue
+info_(this, "> - %s" % module, verbose)
+
+# Make a temp dir for this module (if doesn't exist)
+cur_tmpdir = tmpdir + '/' + module + '/test'  # tmpdir is a global 
variable
+_make_dir(cur_tmpdir)
+
+# Find the Python module dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/" + dbver + 
"/modules/" + module):
+maddir_mod_py = maddir + "/ports/" + portid + "/" + dbver + 
"/modules"
+else:
+maddir_mod_py = maddir + "/modules"
+
+# Find the SQL module dir (platform specific or generic)
+if os.path.isdir(maddir + "/ports/" + portid + "/modules/" + 
module):
+maddir_mod_sql = maddir + "/ports/" + portid + "/modules"
+else:
+maddir_mod_sql = maddir + "/modules"
+
+# Prepare test schema
+test_schema = "madlib_installcheck_%s" % (module)
+_internal_run_query("DROP SCHEMA IF EXISTS %s CASCADE; CREATE 
SCHEMA %s;" %
+(test_schema, test_schema), True)
+_internal_run_query("GRANT ALL ON SCHEMA %s TO %s;" %
+(test_schema, test_user), True)
+
+# Switch to test user and prepare the search_path
+pre_sql = '-- Switch to test user:\n' \
+  'SET ROLE %s;\n' \
+  '-- Set SEARCH_PATH for install-check:\n' \
+  'SET search_path=%s,%s;\n' \
+  % (test_user, test_schema, schema)
+
+# Loop through all test SQL files for this module
+sql_files = maddir_mod_sql + '/' + module + '/test/*.sql_in'
+for sqlfile in sorted(glob.glob(sql_files), reverse=True):
+algoname = os.path.basename(sqlfile).split('.')[0]
+# run only algo specified
+if (module in modset and modset[module] and
+algoname not in modset[module]):
+continue
+
+# Set file names
+tmpfile = cur_tmpdir + '/' + os.path.basename(sqlfile) + '.tmp'
+   

[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...

2018-05-30 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/271#discussion_r191586806
  
--- Diff: src/madpack/madpack.py ---
@@ -131,10 +141,73 @@ def _get_relative_maddir(maddir, port):
 return maddir
 # 
--
 
+def _cleanup_comments_in_sqlfile(output_filename, upgrade):
--- End diff --

This function does more than just cleaning up the comments. It also creates 
a new file with the cleaned contents. Maybe rename the function to reflect this 
?




---


[GitHub] madlib pull request #268: DT: Don't use NULL value to get dep_var type

2018-05-25 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/268#discussion_r185970102
  
--- Diff: src/ports/postgres/modules/utilities/validate_args.py_in ---
@@ -368,8 +375,10 @@ def get_expr_type(expr, tbl):
 SELECT pg_typeof({0}) AS type
 FROM {1}
 LIMIT 1
-""".format(expr, tbl))[0]['type']
-return expr_type.lower()
+""".format(expr, tbl))
+if not expr_type:
+plpy.error("Table {0} does not contain any valid 
tuples".format(tbl))
--- End diff --

maybe the error message should also mention that we failed to get the 
expression type of column foo in table bar. what do you think ?


---


[GitHub] madlib pull request #272: MLP: Add momentum and nesterov to gradient updates...

2018-05-24 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/272

MLP: Add momentum and nesterov to gradient updates.

JIRA: MADLIB-1210

We refactored the minibatch code to separate out the momentum and model
update functions. We initially were using the same function to get the
loss and gradient for both igd and minibatch but the overhead of
creating and updating the total_gradient_per_layer variable made igd
slower. So we decided not to use the same code and are now calling the
model and momentum update functions for both igd and minibatch

Co-authored-by: Rahul Iyer<ri...@apache.org>
Co-authored-by: Jingyi Mei <j...@pivotal.io>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/mlp_momentum

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/272.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #272


commit 176e197f48732443ce658c5d02cefc8c45e7ff52
Author: Rahul Iyer <riyer@...>
Date:   2018-05-02T12:25:48Z

MLP: Add momentum and nesterov to gradient updates.

JIRA: MADLIB-1210

We refactored the minibatch code to separate out the momentum and model
update functions. We initially were using the same function to get the
loss and gradient for both igd and minibatch but the overhead of
creating and updating the total_gradient_per_layer variable made igd
slower. So we decided not to use the same code and are now calling the
model and momentum update functions for both igd and minibatch

Co-authored-by: Rahul Iyer<ri...@apache.org>
Co-authored-by: Jingyi Mei <j...@pivotal.io>




---



[GitHub] madlib pull request #260: minibatch preprocessor improvements

2018-04-11 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/260#discussion_r180851151
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -387,6 +397,7 @@ class MiniBatchStandardizer:
 ) as {ind_colname}
 FROM {source_table}
 """.format(
+standardized_table = self.standardized_table,
--- End diff --

yes, #259 also made a few changes related to this. Will update all that are 
remaining 


---


[GitHub] madlib pull request #260: minibatch preprocessor improvements

2018-04-10 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/260

minibatch preprocessor improvements

This PR makes two improvements to the preprocessor code

1. Check for all character types for dependent col
2. Create temp table for standardization.

See the commit for more details

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib 
feature/minibatch-preprocessing-improvements

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/260.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #260


commit d5e996a1eb3ea1d28151b48e435f40a3a764aa51
Author: Nikhil Kak <nkak@...>
Date:   2018-04-06T18:35:16Z

Utilities: Add functions for postgres character/boolean type comparison.

This commit adds two functions to check if a given type matches one of the 
predefined postgres character or boolean types.

commit 0f6ca99f4de32f1a235fed612d3b74bf822ef3f9
Author: Nikhil Kak <nkak@...>
Date:   2018-04-06T18:42:41Z

MiniBatch Preprocessor: Check for all character types for dependent col

This commit enables support for dependent column type to be any of the 
postgres character
types instead of just `text`.

commit e3462580b7d43589c8a52244029e056ce182a529
Author: Nikhil Kak <nkak@...>
Date:   2018-04-06T20:55:46Z

Minibatch Preprocessor: Create temp table for standardization.

We did a few experiments and the results proved that creating a temp table 
for standardization is faster than using a subquery.
This commit now creates a temp table for the standardization.
Before this commit, we were calling the `utils_normalize_data` function 
inside the main query but now we create a temp table from the
output of `utils_normalize_data` and use the table in the main query.




---


[GitHub] madlib pull request #253: MLP: Add install check tests for minibatch with gr...

2018-03-29 Thread kaknikhil
Github user kaknikhil closed the pull request at:

https://github.com/apache/madlib/pull/253


---


[GitHub] madlib issue #253: MLP: Add install check tests for minibatch with grouping

2018-03-29 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/253
  
Closed by ab7166ff4fc55311ec29bb8b54d17becd9bb1750


---


[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...

2018-03-29 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/250#discussion_r177912997
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -667,7 +678,8 @@ def _validate_dependent_var(source_table, 
dependent_varname,
 if is_classification:
 # Currently, classification doesn't accept an
--- End diff --

This comment is not relevant anymore. 


---


[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...

2018-03-29 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/250#discussion_r177912480
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -667,7 +678,8 @@ def _validate_dependent_var(source_table, 
dependent_varname,
 if is_classification:
 # Currently, classification doesn't accept an
 # array for dep type in IGD
-_assert("[]" not in expr_type and expr_type in 
classification_types,
+_assert(("[]" in expr_type and expr_type[:-2] in int_types) \
--- End diff --

why are we checking for only int types and not all numeric types  ? 


---


[GitHub] madlib pull request #250: MLP: Allow one-hot encoded dependent var for class...

2018-03-29 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/250#discussion_r177913869
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -856,8 +868,16 @@ def mlp_predict(schema_madlib, model_table, 
data_table, id_col_name,
 activation = _get_activation_index(summary['activation'])
 layer_sizes = PY2SQL(
 summary['layer_sizes'], array_type="DOUBLE PRECISION")
-is_classification = int(summary["is_classification"])
 is_response = int(pred_type == 'response')
+is_classification = int(summary["is_classification"])
+classes = summary['classes']
+# Set a flag to indicate that it is a classification model, with an 
array
+# as the dependent var. The only scenario where classification allows 
for
+# an array dep var is when the user has provided a one-hot encoded dep 
var
+# during training, and mlp_classification does not one-hot encode
+# (and hence classes column in model's summary table is NULL).
+array_dep_var_for_classification = int(is_classification and not 
classes)
--- End diff --

can we change the name of `array_dep_var_for_classification` to something 
like`is_dep_var_an_array` so that it's clear that it's a flag.


---


[GitHub] madlib pull request #253: MLP: Add install check tests for minibatch with gr...

2018-03-28 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/253

MLP: Add install check tests for minibatch with grouping

This PR adds install check tests for MLP minibatch with grouping. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib feature/mlp-minibatch-grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/253.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #253


commit fe0bc93d83fe295658d689f4957eb9d12a513c23
Author: Nikhil Kak <nkak@...>
Date:   2018-03-26T18:55:25Z

MLP: Add install check tests for minibatch with grouping




---


[GitHub] madlib issue #249: RF: Use NULL::integer[] when no continuous features

2018-03-27 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/249
  
The changes look good. +1 for adding an install check test.

I noticed that `con_features` is populated by 
`decision_tree.py:_classify_features` which does something like `con_types = 
['real', 'float8', 'double precision', 'numeric']`. Is it better to call the 
utilities function `is_psql_numeric_type`  ? 


---


[GitHub] madlib issue #247: SVM: Revert minibatch-related work

2018-03-22 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/247
  
@iyerr3 
Thanks for the detailed explanation. I agree, we should avoid a redundant 
commit and take care of this in a future commit. 


---


[GitHub] madlib issue #247: SVM: Revert minibatch-related work

2018-03-22 Thread kaknikhil
Github user kaknikhil commented on the issue:

https://github.com/apache/madlib/pull/247
  
@iyerr3 
I compared this PR with the original commit and it looks good. Just a 
couple of minor questions
1. This change to model.hpp was not reverted. It looks like it's not 
related to minibatch but just wanted to confirm it with you
```
-for (k = 1; k <= N; k ++) {
-u.push_back(Eigen::Map(
-const_cast<double*>(data + sizeOfU),
-n[k-1] + 1, n[k]));
-sizeOfU += (n[k-1] + 1) * (n[k]);
+for (k = 0; k < N; k ++) {
+// u.push_back(Eigen::Map(
+// const_cast<double*>(data + sizeOfU),
+// n[k] + 1, n[k+1]));
+u.push_back(MutableMappedMatrix());
+u[k].rebind(const_cast(data + sizeOfU), n[k] + 1, 
n[k+1]);
+sizeOfU += (n[k] + 1) * (n[k+1]);
```
2. A function named `initialize` was added to model.hpp but I didn't see it 
getting used anywhere. Should this be reverted as well ?

I also cherry picked this revert commit on top of the mlp minibatch branch 
`mlp-minibatch-with-preprocessed-data-rebased` and successfully ran install 
check for svm and mlp with greenplum. 


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-21 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r176262391
  
--- Diff: src/ports/postgres/modules/convex/test/mlp.sql_in ---
@@ -340,6 +181,51 @@ INSERT INTO iris_data VALUES
 (149,ARRAY[6.2,3.4,5.4,2.3],'Iris-virginica',3,2),
 (150,ARRAY[5.9,3.0,5.1,1.8],'Iris-virginica',3,2);
 
+-- NOTE that the batch specific tables were created using:
+-- madlib.minibatch_preprocessor(), with the regular source tables used in
+-- this file.
+
+-- Create preprocessed data that can be used with minibatch MLP:
+DROP TABLE IF EXISTS iris_data_batch, iris_data_batch_summary, 
iris_data_batch_standardization;
+CREATE TABLE iris_data_batch(
+__id__ integer,
+dependent_varname double precision[],
+independent_varname double precision[]
+);
+COPY iris_data_batch (__id__, dependent_varname, independent_varname) FROM 
STDIN NULL '?' DELIMITER '|';
+0 | 
{{0,1,0},{0,1,0},{0,0,1},{1,0,0},{0,1,0},{0,1,0},{0,0,1},{1,0,0},{1,0,0},{0,1,0},{1,0,0},{0,0,1},{0,0,1},{0,0,1},{1,0,0},{0,0,1},{0,0,1},{1,0,0},{1,0,0},{0,0,1},{0,1,0},{0,0,1},{0,0,1},{0,0,1},{0,0,1},{1,0,0},{0,1,0},{0,0,1},{0,0,1},{1,0,0}}
 | 
{{0.828881825720994,-0.314980522532101,0.363710790466334,0.159758615207397},{-1.08079689039279,-1.57669227467446,-0.229158821743702,-0.240110581430527},{-1.08079689039279,-1.32434992424599,0.482284712908341,0.692917544057962},{-1.46273263361555,0.442046528753317,-1.35561108494277,-1.30642843913166},{-0.0623015751321059,-0.567322872960574,0.245136868024327,0.159758615207397},{-0.189613489539692,-0.819665223389045,0.304423829245331,0.159758615207397},{0.701569911313408,-1.32434992424599,0.778719519013359,0.959497008483245},{-1.20810880480038,-0.0626381721036282,-1.35561108494277,-1.4397181713443},{-0.698861147170034,0.946731229610261,-1.35561108494277,-1.30642843913166},{-0.82617306157762,-1.32434992424599,-0.407019705406713,-0.106820849
 
217886},{-0.698861147170034,2.71312768260957,-1.29632412372177,-1.4397181713443},{1.33812948335134,0.442046528753317,1.31230217000239,1.49265593733381},{0.319634168090651,-0.0626381721036282,0.660145596571352,0.826207276270604},{0.701569911313408,-1.32434992424599,0.778719519013359,0.959497008483245},{-0.698861147170034,1.19907358003873,-1.29632412372177,-1.30642843913166},{1.46544139775892,0.189704178324845,0.838006480234363,1.49265593733381},{1.21081756894375,-0.0626381721036282,0.897293441455367,1.49265593733381},{-0.444237318354863,1.70375828089568,-1.29632412372177,-1.30642843913166},{-0.82617306157762,1.95610063132415,-1.05917627883775,-1.03984897470638},{0.828881825720994,-0.819665223389045,0.95658040267637,0.959497008483245},{0.956193740128579,-0.567322872960574,0.541571674129345,0.42633807963268},{1.33812948335134,0.442046528753317,1.31230217000239,1.49265593733381},{0.574257996905822,0.946731229610261,1.01586736389737,1.49265593733381},{0.0650103392754793,-0.81966522338904
 
5,0.838006480234363,0.959497008483245},{0.0650103392754793,-0.819665223389045,0.838006480234363,0.959497008483245},{-1.46273263361555,0.442046528753317,-1.35561108494277,-1.30642843913166},{0.574257996905822,-2.08137697553141,0.482284712908341,0.42633807963268},{1.21081756894375,0.189704178324845,1.13444128633938,1.62594566954645},{1.97468905538926,-0.314980522532101,1.54945001488641,0.826207276270604},{-1.08079689039279,0.189704178324845,-1.29632412372177,-1.4397181713443}}
+1 | 
{{0,1,0},{1,0,0},{0,1,0},{1,0,0},{1,0,0},{1,0,0},{1,0,0},{0,1,0},{0,0,1},{0,0,1},{1,0,0},{0,0,1},{1,0,0},{0,0,1},{0,1,0},{0,1,0},{0,1,0},{1,0,0},{1,0,0},{0,0,1},{0,1,0},{0,1,0},{0,0,1},{1,0,0},{1,0,0},{0,1,0},{1,0,0},{0,0,1},{0,1,0},{0,1,0}}
 | 
{{-0.0623015751321059,-0.0626381721036282,0.304423829245331,0.0264688829947554},{-0.316925403947277,2.96547003303804,-1.35561108494277,-1.30642843913166},{0.319634168090651,-0.819665223389045,0.838006480234363,0.559627811845321},{-0.953484975985206,1.19907358003873,-1.41489804616377,-1.17313870691902},{-0.953484975985206,0.442046528753317,-1.47418500738478,-1.30642843913166},{-1.33542071920796,0.442046528753317,-1.41489804616377,-1.30642843913166},{-1.71735646243072,-0.0626381721036282,-1.41489804616377,-1.30642843913166},{0.446946082498236,-0.0626381721036282,0.541571674129345,0.293048347420038},{1.21081756894375,-1.32434992424599,1.25301520878139,0.826207276270604},{0.701569911313408,0.694388879181789,1.3715891312234,1.75923540175909
 
},{-1.84466837683831,-0.0626381721036282,-1.53347196860578,-1.4397181713443},{1.84737714098168,1.45141593046721,1.430876092,1.75923540175909},{-0.82617306157762,1.19907358003873,-1.35561108494277,-1.30642843913166},{0.701569911313408,-0.314980522532101,1.13444128633938,0.826207276270604},{1.33812948335134,-0.567322872960574,0.660145596571352,0.293048347420038},{0.192322253683066,-0.0626381721036282,0.304423829245331,0.42633807963268},{-0.189613489539692,-0.819665223389045,0.304423829245331,0.159758615207397

[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175888098
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -590,51 +664,103 @@ def _validate_warm_start(output_table, 
summary_table, standardization_table,
   output_table + ". Invalid number of coefficients in 
model.")
 return coeff
 
+def _validate_dependent_var(source_table, dependent_varname,
+is_classification, is_minibatch_enabled):
+expr_type = get_expr_type(dependent_varname, source_table)
+int_types = ['integer', 'smallint', 'bigint']
+text_types = ['text', 'varchar', 'character varying', 'char', 
'character']
+boolean_types = ['boolean']
+float_types = ['double precision', 'real']
+classification_types = int_types + boolean_types + text_types
+regression_types = int_types + float_types
+validate_type = classification_types if is_classification else 
regression_types
+
+if is_minibatch_enabled:
+# With pre-processed data, dep type is always an array
+_assert("[]" in expr_type,
+"Dependent variable column should refer to an array.")
+# The dependent variable is always a double precision array in
+# preprocessed data (so use regression_types)
+# strip out '[]' from expr_type
+_assert(expr_type[:-2] in regression_types,
--- End diff --

There are other numeric types like `decimal`, `numeric` etc. That makes me 
think if we really need this assert ? Same for the regression case for igd at 
line 696
And if we really want to assert this, consider using the function 
`is_psql_numeric_type` in `utilities_py.in`


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175893520
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1457,3 +1660,85 @@ def mlp_predict_help(schema_madlib, message):
 return """
 No such option. Use "SELECT {schema_madlib}.mlp_predict()" for 
help.
 """.format(**args)
+
+
+def check_if_minibatch_enabled(source_table, independent_varname):
+"""
+Function to validate if the source_table is converted to a format 
that
+can be used for mini-batching. It checks for the dimensionalities 
of
+the independent variable to determine the same.
+"""
+query = """
+SELECT array_upper({0}, 1) AS n_x,
+   array_upper({0}, 2) AS n_y,
+   array_upper({0}, 3) AS n_z
+FROM {1}
+LIMIT 1
+""".format(independent_varname, source_table)
+result = plpy.execute(query)
+
+if not result:
+plpy.error("MLP: Input table could be empty.")
+
+has_x_dim, has_y_dim, has_z_dim = [bool(result[0][i])
+   for i in ('n_x', 'n_y', 'n_z')]
+if not has_x_dim:
+plpy.error("MLP: {0} is empty.".format(independent_varname))
+
+# error out if >2d matrix
+if has_z_dim:
+plpy.error("MLP: Input table is not in the right format.")
+return has_y_dim
+
+
+class MLPPreProcessor:
+"""
+This class consumes and validates the pre-processed source table used 
for
+MLP mini-batch. This also populates values from the pre-processed 
summary
+table which is used by MLP mini-batch
+
+"""
+# summary table columns names
+DEPENDENT_VARNAME = "dependent_varname"
+INDEPENDENT_VARNAME = "independent_varname"
+GROUPING_COL = "grouping_cols"
+CLASS_VALUES = "class_values"
+MODEL_TYPE_CLASSIFICATION = "classification"
+MODEL_TYPE_REGRESSION = "regression"
+
+def __init__(self, source_table):
+self.source_table = source_table
+self.preprocessed_summary_dict = None
+self.summary_table = add_postfix(self.source_table, "_summary")
+self.std_table = add_postfix(self.source_table, "_standardization")
+
+self._validate_and_set_preprocessed_summary()
+
+def _validate_and_set_preprocessed_summary(self):
+input_tbl_valid(self.source_table, 'MLP')
+
+if not table_exists(self.summary_table) or not 
table_exists(self.std_table):
+plpy.error("Tables {0} and/or {1} do not exist. These tables 
are"
+   " needed for using minibatch during 
training.".format(
+ 
self.summary_table,
+ 
self.std_table))
+
+query = "SELECT * FROM {0}".format(self.summary_table)
+summary_table_columns = plpy.execute(query)
+if not summary_table_columns or len(summary_table_columns) == 0:
+plpy.error("No columns in table 
{0}.".format(self.summary_table))
+else:
+summary_table_columns = summary_table_columns[0]
+
+required_columns = (self.DEPENDENT_VARNAME, 
self.INDEPENDENT_VARNAME,
--- End diff --

we also use `buffer_size` and `source_table` columns from the summary 
table. Do we need to validate those as well or are these three enough ?
If we decide to assert all columns that we consume, we will have to keep 
this assert in sync with how we use the summary dict which is easy to forget. I 
don't have a better solution but just wanted to mention it. 


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175891761
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -72,107 +73,127 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 """
 warm_start = bool(warm_start)
 optimizer_params = _get_optimizer_params(optimizer_param_str or "")
+
+tolerance = optimizer_params["tolerance"]
+n_iterations = optimizer_params["n_iterations"]
+step_size_init = optimizer_params["learning_rate_init"]
+iterations_per_step = optimizer_params["iterations_per_step"]
+power = optimizer_params["power"]
+gamma = optimizer_params["gamma"]
+step_size = step_size_init
+n_tries = optimizer_params["n_tries"]
+# lambda is a reserved word in python
+lmbda = optimizer_params["lambda"]
+batch_size = optimizer_params['batch_size']
+n_epochs = optimizer_params['n_epochs']
+
 summary_table = add_postfix(output_table, "_summary")
 standardization_table = add_postfix(output_table, "_standardization")
-weights = '1' if not weights or not weights.strip() else 
weights.strip()
 hidden_layer_sizes = hidden_layer_sizes or []
 
-grouping_col = grouping_col or ""
-activation = _get_activation_function_name(activation)
-learning_rate_policy = _get_learning_rate_policy_name(
-optimizer_params["learning_rate_policy"])
-activation_index = _get_activation_index(activation)
-
+# Note that we don't support weights with mini batching yet, so 
validate
+# this based on is_minibatch_enabled.
+weights = '1' if not weights or not weights.strip() else 
weights.strip()
 _validate_args(source_table, output_table, summary_table,
standardization_table, independent_varname,
dependent_varname, hidden_layer_sizes, optimizer_params,
-   is_classification, weights, warm_start, activation,
-   grouping_col)
+   warm_start, activation, grouping_col)
+is_minibatch_enabled = check_if_minibatch_enabled(source_table, 
independent_varname)
+_validate_params_based_on_minibatch(source_table, independent_varname,
+dependent_varname, weights,
+is_classification,
+is_minibatch_enabled)
+activation = _get_activation_function_name(activation)
+learning_rate_policy = _get_learning_rate_policy_name(
+optimizer_params["learning_rate_policy"])
+activation_index = _get_activation_index(activation)
 
 reserved_cols = ['coeff', 'loss', 'n_iterations']
+grouping_col = grouping_col or ""
 grouping_str, grouping_col = get_grouping_col_str(schema_madlib, 'MLP',
   reserved_cols,
   source_table,
   grouping_col)
-current_iteration = 1
-prev_state = None
-tolerance = optimizer_params["tolerance"]
-n_iterations = optimizer_params["n_iterations"]
-step_size_init = optimizer_params["learning_rate_init"]
-iterations_per_step = optimizer_params["iterations_per_step"]
-power = optimizer_params["power"]
-gamma = optimizer_params["gamma"]
-step_size = step_size_init
-n_tries = optimizer_params["n_tries"]
-# lambda is a reserved word in python
-lmbda = optimizer_params["lambda"]
-iterations_per_step = optimizer_params["iterations_per_step"]
-num_input_nodes = array_col_dimension(source_table,
-  independent_varname)
-num_output_nodes = 0
+dependent_varname_backup = dependent_varname
 classes = []
-dependent_type = get_expr_type(dependent_varname, source_table)
-original_dependent_varname = dependent_varname
-
-x_mean_table = unique_string(desp='x_mean_table')
-dimension, n_tuples = _tbl_dimension_rownum(schema_madlib, 
source_table,
-independent_varname)
-
-tbl_data_scaled = unique_string(desp="tbl_data_scaled")
-col_ind_var_norm_new = unique_string(desp="ind_var_norm")
-col_dep_var_norm_new = unique_string(desp="dep_var_norm")

[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175627412
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType ) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType ) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState<MutableArrayHandle > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState<ArrayHandle > previousState = 
args[3];
+state.allocate(*this, previousState.task.numberOfStages,
+   previousState.task.numbersOfUnits);
+state = previousState;
+} else {
+// configuration parameters
+ArrayHandle numbersOfUnits = 
args[4].getAs<ArrayHandle >();
--- End diff --

is it possible to reuse the code that gets the values from the args 
parameter ? I noticed that the igd transition function `mlp_igd_transition ` 
has the exact same code.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175872591
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -111,6 +117,57 @@ class MLP {
 template 
 double MLP<Model, Tuple>::lambda = 0;
 
+template 
+double
+MLP<Model, Tuple>::getLossAndUpdateModel(
+model_type   ,
+const Matrix _batch,
+const Matrix _true_batch,
+const double ) {
+
+uint16_t N = model.u.size(); // assuming nu. of layers >= 1
+size_t n = x_batch.rows();
--- End diff --

is there a reason we chose N and n as variable names ? Can we use more 
descriptive names ?




---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175923655
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -33,11 +34,12 @@ from convex.utils_regularization import 
__utils_normalize_data_grouping
 
 from utilities.in_mem_group_control import GroupIterationController
 from utilities.utilities import _array_to_string
+from utilities.utilities import add_postfix
+from utilities.utilities import py_list_to_sql_string as PY2SQL
+from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import _assert
 from utilities.utilities import _assert_equal
 from utilities.utilities import _string_to_array_with_quotes
-from utilities.utilities import add_postfix
-from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import py_list_to_sql_string
--- End diff --

we don't need this import anymore


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175894372
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1457,3 +1660,85 @@ def mlp_predict_help(schema_madlib, message):
 return """
 No such option. Use "SELECT {schema_madlib}.mlp_predict()" for 
help.
 """.format(**args)
+
+
+def check_if_minibatch_enabled(source_table, independent_varname):
+"""
+Function to validate if the source_table is converted to a format 
that
+can be used for mini-batching. It checks for the dimensionalities 
of
+the independent variable to determine the same.
+"""
+query = """
+SELECT array_upper({0}, 1) AS n_x,
+   array_upper({0}, 2) AS n_y,
+   array_upper({0}, 3) AS n_z
+FROM {1}
+LIMIT 1
+""".format(independent_varname, source_table)
+result = plpy.execute(query)
+
+if not result:
+plpy.error("MLP: Input table could be empty.")
+
+has_x_dim, has_y_dim, has_z_dim = [bool(result[0][i])
+   for i in ('n_x', 'n_y', 'n_z')]
+if not has_x_dim:
+plpy.error("MLP: {0} is empty.".format(independent_varname))
+
+# error out if >2d matrix
+if has_z_dim:
+plpy.error("MLP: Input table is not in the right format.")
+return has_y_dim
+
+
+class MLPPreProcessor:
+"""
+This class consumes and validates the pre-processed source table used 
for
+MLP mini-batch. This also populates values from the pre-processed 
summary
+table which is used by MLP mini-batch
+
+"""
+# summary table columns names
+DEPENDENT_VARNAME = "dependent_varname"
+INDEPENDENT_VARNAME = "independent_varname"
+GROUPING_COL = "grouping_cols"
+CLASS_VALUES = "class_values"
+MODEL_TYPE_CLASSIFICATION = "classification"
+MODEL_TYPE_REGRESSION = "regression"
+
+def __init__(self, source_table):
+self.source_table = source_table
+self.preprocessed_summary_dict = None
+self.summary_table = add_postfix(self.source_table, "_summary")
+self.std_table = add_postfix(self.source_table, "_standardization")
+
+self._validate_and_set_preprocessed_summary()
+
+def _validate_and_set_preprocessed_summary(self):
+input_tbl_valid(self.source_table, 'MLP')
+
+if not table_exists(self.summary_table) or not 
table_exists(self.std_table):
+plpy.error("Tables {0} and/or {1} do not exist. These tables 
are"
+   " needed for using minibatch during 
training.".format(
+ 
self.summary_table,
+ 
self.std_table))
+
+query = "SELECT * FROM {0}".format(self.summary_table)
+summary_table_columns = plpy.execute(query)
+if not summary_table_columns or len(summary_table_columns) == 0:
+plpy.error("No columns in table 
{0}.".format(self.summary_table))
+else:
+summary_table_columns = summary_table_columns[0]
+
+required_columns = (self.DEPENDENT_VARNAME, 
self.INDEPENDENT_VARNAME,
+self.CLASS_VALUES)
+if set(required_columns) <= set(summary_table_columns):
+self.preprocessed_summary_dict = summary_table_columns
+else:
+plpy.error("Expected columns ({0}, {1} and/or {2}) not present 
in"
--- End diff --

We can use the `required_columns` to format the error message so that we 
don't have to repeat the column names. Something like 
```
plpy.error("One or more of the expected columns {0} not present in 
{1}".format(required_columns, self.summary_table))
```


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175923217
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 # used, it will be an empty list if there was not 
grouping.
 groups = [t[col_grp_key] for t in res if 
t[col_grp_key]]
 losses = [t['loss'] for t in res]
-loss = zip(groups, losses) if len(groups)==len(losses) 
\
-   else losses
-plpy.info("Iteration: " + str(it.iteration) + ", Loss: 
<" + \
-  ', '.join([str(l) for l in loss]) + ">")
+loss = zip(groups, losses) if groups else losses
+plpy.info("Iteration: {0}, Loss: <{1}>".
+  format(it.iteration, ', '.join(map(str, 
loss
 it.final()
 _update_temp_model_table(it_args, it.iteration, temp_output_table,
- first_try)
+ is_minibatch_enabled, first_try)
 first_try = False
-layer_sizes_str = py_list_to_sql_string(
-layer_sizes, array_type="integer")
-classes_str = py_list_to_sql_string(
-[strip_end_quotes(cl, "'") for cl in classes],
-array_type=dependent_type)
+layer_sizes_str = py_list_to_sql_string(layer_sizes,
+array_type="integer")
+
 _create_summary_table(locals())
-_create_standardization_table(standardization_table, x_mean_table,
-  warm_start)
+if is_minibatch_enabled:
+# We already have the mean and std in the input standardization 
table
+input_std_table = add_postfix(source_table, '_standardization')
+_create_standardization_table(standardization_table, 
input_std_table,
+  warm_start)
+else:
+_create_standardization_table(standardization_table, x_mean_table,
+  warm_start)
+# The original input table is the tab_data_scaled for mini batch.
--- End diff --

this comment should be moved to the if block


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175889157
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -590,51 +664,103 @@ def _validate_warm_start(output_table, 
summary_table, standardization_table,
   output_table + ". Invalid number of coefficients in 
model.")
 return coeff
 
+def _validate_dependent_var(source_table, dependent_varname,
+is_classification, is_minibatch_enabled):
+expr_type = get_expr_type(dependent_varname, source_table)
+int_types = ['integer', 'smallint', 'bigint']
+text_types = ['text', 'varchar', 'character varying', 'char', 
'character']
+boolean_types = ['boolean']
+float_types = ['double precision', 'real']
+classification_types = int_types + boolean_types + text_types
+regression_types = int_types + float_types
+validate_type = classification_types if is_classification else 
regression_types
--- End diff --

I think it's slightly cleaner if we don't use the `validate_type ` variable 
but use the `classification_types` and `regression_types ` variables.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175921215
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 it_args.update({
 'group_by_clause': group_by_clause,
 'using_clause': using_clause,
-'grouping_str_comma': grouping_str_comma
+'grouping_str_comma': grouping_str_comma,
 })
 
 first_try = True
 temp_output_table = unique_string(desp='temp_output_table')
+
+layer_sizes = [num_input_nodes] + hidden_layer_sizes + 
[num_output_nodes]
+
 for _ in range(n_tries):
+prev_state = None
 if not warm_start:
 coeff = []
-for i in range(len(layer_sizes) - 1):
-fan_in = layer_sizes[i]
-fan_out = layer_sizes[i + 1]
+for fan_in, fan_out in zip(layer_sizes, layer_sizes[1:]):
 # Initalize according to Glorot and Bengio (2010)
 # See design doc for more info
 span = math.sqrt(6.0 / (fan_in + fan_out))
-dim = (layer_sizes[i] + 1) * layer_sizes[i + 1]
-rand = plpy.execute("""SELECT 
array_agg({span}*2*(random()-0.5))
-   AS random
-   FROM generate_series(0,{dim})
-""".format(span=span, dim=dim))[0]["random"]
+dim = (fan_in + 1) * fan_out
+rand = [span * (random() - 0.5) for _ in range(dim)]
--- End diff --

why are we subtracting 0.5 from `random()` ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175628144
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType ) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType ) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState<MutableArrayHandle > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState<ArrayHandle > previousState = 
args[3];
+state.allocate(*this, previousState.task.numberOfStages,
+   previousState.task.numbersOfUnits);
+state = previousState;
+} else {
+// configuration parameters
+ArrayHandle numbersOfUnits = 
args[4].getAs<ArrayHandle >();
+int numberOfStages = numbersOfUnits.size() - 1;
+
+double stepsize = args[5].getAs();
+
+state.allocate(*this, numberOfStages,
+   reinterpret_cast(numbersOfUnits.ptr()));
+state.task.stepsize = stepsize;
+const int activation = args[6].getAs();
+const int is_classification = args[7].getAs();
+// args[8] is for weighting the input row, which is populated 
later.
+const bool warm_start = args[9].getAs();
+const double lambda = args[11].getAs();
+state.algo.batchSize = args[12].getAs();
+state.algo.nEpochs = args[13].getAs();
+state.task.lambda = lambda;
+MLPTask::lambda = lambda;
+
+/* FIXME: The state is set back to zero for second row onwards 
if
+   initialized as in IGD. The following avoids that, but there 
is
+   some failure with debug build that must be fixed.
+*/
+state.task.model.is_classification =
+static_cast(is_classification);
+state.task.model.activation = static_cast(activation);
+MappedColumnVector initial_coeff = 
args[10].getAs();
+// copy initial_coeff into the model
+Index fan_in, fan_out, layer_start = 0;
+for (size_t k = 0; k < numberOfStages; ++k){
+fan_in = numbersOfUnits[k];
+fan_out = numbersOfUnits[k+1];
+state.task.model.u[k] << 
initial_coeff.segment(layer_start, (fan_in+1)*fan_out);
+layer_start = (fan_in + 1) * fan_out;
+}
+}
+// resetting in either case
+state.reset();
+}
+
+// meta data
+const uint16_t N = state.task.numberOfStages;
+const double *n = state.task.numbersOfUnits;
+
+// tuple
+Matrix indVar;
+Matrix depVar;
+try {
--- End diff --

why do we expect the following 2 lines to fail ? may be add a comment 
explaining the reason.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175877947
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -111,6 +117,57 @@ class MLP {
 template 
 double MLP<Model, Tuple>::lambda = 0;
 
+template 
+double
+MLP<Model, Tuple>::getLossAndUpdateModel(
+model_type   ,
+const Matrix _batch,
+const Matrix _true_batch,
+const double ) {
+
+uint16_t N = model.u.size(); // assuming nu. of layers >= 1
+size_t n = x_batch.rows();
+size_t i, k;
+double total_loss = 0.;
+
+// gradient added over the batch
+std::vector total_gradient_per_layer(N);
+for (k=0; k < N; ++k)
+total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(),
+   model.u[k].cols());
+
+for (i=0; i < n; i++){
+ColumnVector x = x_batch.row(i);
+ColumnVector y_true = y_true_batch.row(i);
+
+std::vector net, o, delta;
+feedForward(model, x, net, o);
+backPropogate(y_true, o.back(), net, model, delta);
+
+for (k=0; k < N; k++){
+total_gradient_per_layer[k] += o[k] * delta[k].transpose();
+}
+
+// loss computation
+ColumnVector y_estimated = o.back();
+if(model.is_classification){
+double clip = 1.e-10;
+y_estimated = y_estimated.cwiseMax(clip).cwiseMin(1.-clip);
--- End diff --

Just curious, why do we need to do re calculate `y_estimated` ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175929883
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 # used, it will be an empty list if there was not 
grouping.
 groups = [t[col_grp_key] for t in res if 
t[col_grp_key]]
 losses = [t['loss'] for t in res]
-loss = zip(groups, losses) if len(groups)==len(losses) 
\
-   else losses
-plpy.info("Iteration: " + str(it.iteration) + ", Loss: 
<" + \
-  ', '.join([str(l) for l in loss]) + ">")
+loss = zip(groups, losses) if groups else losses
+plpy.info("Iteration: {0}, Loss: <{1}>".
+  format(it.iteration, ', '.join(map(str, 
loss
 it.final()
 _update_temp_model_table(it_args, it.iteration, temp_output_table,
- first_try)
+ is_minibatch_enabled, first_try)
 first_try = False
-layer_sizes_str = py_list_to_sql_string(
-layer_sizes, array_type="integer")
-classes_str = py_list_to_sql_string(
-[strip_end_quotes(cl, "'") for cl in classes],
-array_type=dependent_type)
+layer_sizes_str = py_list_to_sql_string(layer_sizes,
+array_type="integer")
+
 _create_summary_table(locals())
-_create_standardization_table(standardization_table, x_mean_table,
-  warm_start)
+if is_minibatch_enabled:
+# We already have the mean and std in the input standardization 
table
+input_std_table = add_postfix(source_table, '_standardization')
+_create_standardization_table(standardization_table, 
input_std_table,
+  warm_start)
+else:
+_create_standardization_table(standardization_table, x_mean_table,
+  warm_start)
+# The original input table is the tab_data_scaled for mini batch.
+# Do NOT drop this, it will end up dropping the original data 
table.
+plpy.execute("DROP TABLE IF EXISTS {0}".format(tbl_data_scaled))
+plpy.execute("DROP TABLE IF EXISTS {0}".format(x_mean_table))
--- End diff --

is there a test for this in install check to assert that the input tables 
including summary and std tables aren't dropped for minibatch ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175620624
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD<State, ConstState, Task>::transition(state_type 
,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i<n_batches; i++) {
+random_curr_batch[i] = i;
+}
+int curr_batch_row_index = 0;
+std::random_shuffle(_curr_batch[0], 
_curr_batch[n_batches]);
--- End diff --

Does it add value to add a comment explaining why we need to do a random 
shuffle ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175871655
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -170,6 +289,24 @@ mlp_igd_final::run(AnyType ) {
 return state;
 }
 
+
+/**
+ * @brief Perform the multilayer perceptron final step
+ */
+AnyType
+mlp_minibatch_final::run(AnyType ) {
+// We request a mutable object. Depending on the backend, this might 
perform
+// a deep copy.
+MLPMiniBatchState<MutableArrayHandle > state = args[0];
+// Aggregates that haven't seen any data just return Null.
+if (state.algo.numRows == 0) { return Null(); }
+
+L2::lambda = state.task.lambda;
+state.algo.loss = 
state.algo.loss/static_cast(state.algo.numRows);
+state.algo.loss += L2::loss(state.task.model);
+return state;
--- End diff --

I noticed that minibatch `AlgoState` does not have an incr model unlike igd 
`AlgoState` . Do you think it makes sense to add a comment to explain this ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175917822
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 it_args.update({
 'group_by_clause': group_by_clause,
 'using_clause': using_clause,
-'grouping_str_comma': grouping_str_comma
+'grouping_str_comma': grouping_str_comma,
 })
 
 first_try = True
 temp_output_table = unique_string(desp='temp_output_table')
+
+layer_sizes = [num_input_nodes] + hidden_layer_sizes + 
[num_output_nodes]
--- End diff --

we already did this at line 176. is this intentional?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175895832
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -72,107 +73,127 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 """
 warm_start = bool(warm_start)
 optimizer_params = _get_optimizer_params(optimizer_param_str or "")
+
+tolerance = optimizer_params["tolerance"]
+n_iterations = optimizer_params["n_iterations"]
+step_size_init = optimizer_params["learning_rate_init"]
+iterations_per_step = optimizer_params["iterations_per_step"]
+power = optimizer_params["power"]
+gamma = optimizer_params["gamma"]
+step_size = step_size_init
+n_tries = optimizer_params["n_tries"]
+# lambda is a reserved word in python
+lmbda = optimizer_params["lambda"]
+batch_size = optimizer_params['batch_size']
+n_epochs = optimizer_params['n_epochs']
+
 summary_table = add_postfix(output_table, "_summary")
 standardization_table = add_postfix(output_table, "_standardization")
-weights = '1' if not weights or not weights.strip() else 
weights.strip()
 hidden_layer_sizes = hidden_layer_sizes or []
 
-grouping_col = grouping_col or ""
-activation = _get_activation_function_name(activation)
-learning_rate_policy = _get_learning_rate_policy_name(
-optimizer_params["learning_rate_policy"])
-activation_index = _get_activation_index(activation)
-
+# Note that we don't support weights with mini batching yet, so 
validate
+# this based on is_minibatch_enabled.
+weights = '1' if not weights or not weights.strip() else 
weights.strip()
 _validate_args(source_table, output_table, summary_table,
standardization_table, independent_varname,
dependent_varname, hidden_layer_sizes, optimizer_params,
-   is_classification, weights, warm_start, activation,
-   grouping_col)
+   warm_start, activation, grouping_col)
+is_minibatch_enabled = check_if_minibatch_enabled(source_table, 
independent_varname)
+_validate_params_based_on_minibatch(source_table, independent_varname,
+dependent_varname, weights,
+is_classification,
+is_minibatch_enabled)
+activation = _get_activation_function_name(activation)
+learning_rate_policy = _get_learning_rate_policy_name(
+optimizer_params["learning_rate_policy"])
+activation_index = _get_activation_index(activation)
 
 reserved_cols = ['coeff', 'loss', 'n_iterations']
+grouping_col = grouping_col or ""
 grouping_str, grouping_col = get_grouping_col_str(schema_madlib, 'MLP',
   reserved_cols,
   source_table,
   grouping_col)
-current_iteration = 1
-prev_state = None
-tolerance = optimizer_params["tolerance"]
-n_iterations = optimizer_params["n_iterations"]
-step_size_init = optimizer_params["learning_rate_init"]
-iterations_per_step = optimizer_params["iterations_per_step"]
-power = optimizer_params["power"]
-gamma = optimizer_params["gamma"]
-step_size = step_size_init
-n_tries = optimizer_params["n_tries"]
-# lambda is a reserved word in python
-lmbda = optimizer_params["lambda"]
-iterations_per_step = optimizer_params["iterations_per_step"]
-num_input_nodes = array_col_dimension(source_table,
-  independent_varname)
-num_output_nodes = 0
+dependent_varname_backup = dependent_varname
--- End diff --

can we add a comment explaining why we need this backup variable ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175873168
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -111,6 +117,57 @@ class MLP {
 template 
 double MLP<Model, Tuple>::lambda = 0;
 
+template 
+double
+MLP<Model, Tuple>::getLossAndUpdateModel(
+model_type   ,
+const Matrix _batch,
+const Matrix _true_batch,
+const double ) {
+
+uint16_t N = model.u.size(); // assuming nu. of layers >= 1
+size_t n = x_batch.rows();
+size_t i, k;
+double total_loss = 0.;
+
+// gradient added over the batch
+std::vector total_gradient_per_layer(N);
+for (k=0; k < N; ++k)
+total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(),
+   model.u[k].cols());
+
+for (i=0; i < n; i++){
+ColumnVector x = x_batch.row(i);
+ColumnVector y_true = y_true_batch.row(i);
+
+std::vector net, o, delta;
+feedForward(model, x, net, o);
--- End diff --

Can we use a more descriptive name for the variable `o` ?




---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175626817
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType ) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType ) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState<MutableArrayHandle > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState<ArrayHandle > previousState = 
args[3];
--- End diff --

can we create this variable outside the if check and then use it if it's 
not null ? It looks cleaner and is easier to follow


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175625333
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD<State, ConstState, Task>::transition(state_type 
,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i<n_batches; i++) {
+random_curr_batch[i] = i;
+}
+int curr_batch_row_index = 0;
+std::random_shuffle(_curr_batch[0], 
_curr_batch[n_batches]);
+for (int i=0; i < n_batches; i++) {
+int curr_batch = random_curr_batch[i];
+int curr_batch_row_index = curr_batch * batch_size;
+Matrix X_batch;
+Matrix Y_batch;
+if (curr_batch == n_batches-1) {
+   // last batch
+   X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
+   Y_batch = 
tuple.depVar.bottomRows(n_rows-curr_batch_row_index);
+} else {
+X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
+Y_batch = tuple.depVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
--- End diff --

Y_batch should use `n_dep_cols` instead of `n_indep_cols`. 


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175621915
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD<State, ConstState, Task>::transition(state_type 
,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i<n_batches; i++) {
+random_curr_batch[i] = i;
+}
+int curr_batch_row_index = 0;
+std::random_shuffle(_curr_batch[0], 
_curr_batch[n_batches]);
--- End diff --

do we need to set a seed before calling random_shuffle? 


---


[GitHub] madlib pull request #240: MLP: Fix step size initialization based on learnin...

2018-03-16 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/240#discussion_r175224785
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -112,6 +112,7 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 num_output_nodes = 0
 classes = []
 dependent_type = get_expr_type(dependent_varname, source_table)
+classes_str = None
--- End diff --

we don't really need this `classes_str` variable, right ?


---


[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165529067
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
+# supported strategies.
+sampling_strategy_str = next(x for x in supported_strateg

[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165523942
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
+# supported strategies.
+sampling_strategy_str = next(x for x in supported_strateg

[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165737045
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
+# supported strategies.
+sampling_strategy_str = next(x for x in supported_strateg

[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165736819
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
+# supported strategies.
+sampling_strategy_str = next(x for x in supported_strateg

[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165526039
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
+# supported strategies.
+sampling_strategy_str = next(x for x in supported_strateg

[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165527448
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
+# supported strategies.
+sampling_strategy_str = next(x for x in supported_strateg

[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165479832
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
+# supported strategies.
+sampling_strategy_str = next(x for x in supported_strateg

[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165505215
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
@@ -0,0 +1,355 @@
+/* --- 
*//**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ *
+ * @file balance_sample.sql_in
+ *
+ * @brief SQL functions for balanced data sets sampling.
+ * @date 12/14/2017
+ *
+ * @sa Given a table, balanced sampling returns a sampled data set
+ * with specified balancing for each class (defaults to uniform sampleing).
+ *
+ *//* 
--- */
+
+m4_include(`SQLCommon.m4')
+
+
+/**
+@addtogroup grp_balance_sampling
+
+Contents
+
+Balanced Data Sets Sampling
+Examples
+
+
+
+@brief A method for independently sampling classes to receive a
+balanced data set.
+It is commonly used to where classes have greater imbalanced ratio.
+It ensures the subclasses are adequately represented in the sample.
+
+@anchor strs
+@par Balanced Sampling
+
+
+balance_sample( source_table,
+output_table,
+class_col,
+class_sizes,
+output_table_size,
+grouping_cols,
+with_replacement,
+keep_null
+  )
+
+
+\b Arguments
+
+source_table
+TEXT. Name of the table containing the input data.
+
+output_table
+TEXT. Name of output table that contains the sampled data.
+The output table contains all columns present in the source
+table.
+
+class_col
+TEXT,  Name of the column containing the class to be balanced.
+
+
+class_sizes (optional)
+VARCHAR, default ‘uniform’.
+
+@note
+Current implementation only supports 'undersample'.
+
+Parameter to define the size of the different class values.
+(Class values are sometimes also called levels).
+
+Can take the following forms:
+
+
+‘uniform’:
+All class values will be resampled to have the same number of rows.
+
+'undersample':
+Under-sample such that all class values end up with the same number of
+observations as the minority class.  Done without replacement by default
+unless the parameter ‘with_replacement’ is set to TRUE.
+
+'oversample':
+Over-sample with replacement such that all class values end up with the
+same number of observations as the majority class.  Not affected by the
+parameter ‘with_replacement’ since over-sampling is always done with
+replacement.
+
+
+
+You can also explicitly set class size in a string containing a
+comma-delimited list. Order does not matter and all class values do not
+need to be specified.  Use the format “class_value_1=x, class_value_2=y, 
...”
+where the class value in the list must exist in the column ‘class_col’.
+
+E.g.,  ‘male=3000, female=7000’ means you want to resample the dataset
+to result in 3000 male and 7000 female rows in the ‘output_table’.
+
+@note
+The allowed names for class values follows object naming rules in
+PostgreSQL [6].  Quoted identifiers are allowed and should be enclosed
+in double quotes “ in the usual way.  If for some reason the class values
+in the examples above were “MaLe” and “FeMaLe” then the comma 
delimited
+list for ‘class_size’ would be:  ‘“MaLe”=3000, 
“FeMaLe”=7000’.
+
+
+output_table_size (optional)
+INTEGER, default NULL.  Desired size of the output data set.
+
+This parameter is ignored if ‘class_size’ parameter is set to either
+‘oversample’ or ‘undersample’ since output table size is already 
determined.
+If NULL, the resulting output table size will depend on the sett

[GitHub] madlib pull request #230: Balanced sets final

2018-02-02 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/230#discussion_r165530126
  
--- Diff: src/ports/postgres/modules/sample/balance_sample.py_in ---
@@ -0,0 +1,748 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file EXCEPT in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+m4_changequote(`')
+
+import math
+
+if __name__ != "__main__":
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import get_cols
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+else:
+# Used only for Unit Testing
+# FIXME: repeating a function from utilities that is needed by the 
unit test.
+# This should be removed once a unittest framework in used for testing.
+import random
+import time
+
+def unique_string(desp='', **kwargs):
+"""
+Generate random remporary names for temp table and other names.
+It has a SQL interface so both SQL and Python functions can call 
it.
+"""
+r1 = random.randint(1, 1)
+r2 = int(time.time())
+r3 = int(time.time()) % random.randint(1, 1)
+u_string = "__madlib_temp_" + desp + str(r1) + "_" + str(r2) + "_" 
+ str(r3) + "__"
+return u_string
+# 
--
+
+UNIFORM = 'uniform'
+UNDERSAMPLE = 'undersample'
+OVERSAMPLE = 'oversample'
+NOSAMPLE = 'nosample'
+
+NEW_ID_COLUMN = '__madlib_id__'
+NULL_IDENTIFIER = '__madlib_null_id__'
+
+def _get_frequency_distribution(source_table, class_col):
+""" Returns a dict containing the number of rows associated with each 
class
+level. Each class level value is converted to a string using 
::text.
+"""
+query_result = plpy.execute("""
+SELECT {class_col}::text AS classes,
+   count(*) AS class_count
+FROM {source_table}
+GROUP BY {class_col}
+ """.format(**locals()))
+actual_level_counts = {}
+for each_row in query_result:
+level = each_row['classes']
+if level:
+level = level.strip()
+actual_level_counts[level] = each_row['class_count']
+return actual_level_counts
+
+
+def _validate_and_get_sampling_strategy(sampling_strategy_str, 
output_table_size,
+supported_strategies=None, default=UNIFORM):
+""" Returns the sampling strategy based on the class_sizes input param.
+@param sampling_strategy_str The sampling strategy specified by the
+ user (class_sizes param)
+@returns:
+Str. One of [UNIFORM, UNDERSAMPLE, OVERSAMPLE]. Default is 
UNIFORM.
+"""
+if not sampling_strategy_str:
+sampling_strategy_str = default
+else:
+if len(sampling_strategy_str) < 3:
+# Require at least 3 characters since UNIFORM and UNDERSAMPLE 
have
+# common prefix substring
+plpy.error("Sample: Invalid class_sizes parameter")
+
+if not supported_strategies:
+supported_strategies = [UNIFORM, UNDERSAMPLE, OVERSAMPLE]
+try:
+# allow user to specify a prefix substring of
+# supported strategies.
+sampling_strategy_str = next(x for x in supported_strateg

  1   2   >