[GitHub] incubator-madlib pull request #89: K-means: support for array input

2017-01-23 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/89#discussion_r97463007
  
--- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
@@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, 
**kwargs):
 
 # --
 
+def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
+if not columns_exist_in_table(rel_source, [expr_point]):
--- End diff --

I think it'd be much clearer to make this if columns_exist... and return 
immediately if true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #89: K-means: support for array input

2017-01-23 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/89#discussion_r97463583
  
--- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
@@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, 
**kwargs):
 
 # --
 
+def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
+if not columns_exist_in_table(rel_source, [expr_point]):
+
+p = re.compile('[Aa][Rr][Rr][Aa][Yy]\s*\[\s*[\s*\w|,\s*]+\s*\]')
+
+if p.match(expr_point.strip()):
+view_name = unique_string('km_view')
+
+plpy.execute(""" CREATE VIEW {view_name} AS
--- End diff --

Would this be better as a TEMP view? Or do all the callers create permanent 
tables?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #89: K-means: support for array input

2017-01-23 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/89#discussion_r97463365
  
--- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
@@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, 
**kwargs):
 
 # --
 
+def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
+if not columns_exist_in_table(rel_source, [expr_point]):
+
+p = re.compile('[Aa][Rr][Rr][Aa][Yy]\s*\[\s*[\s*\w|,\s*]+\s*\]')
+
+if p.match(expr_point.strip()):
+view_name = unique_string('km_view')
+
+plpy.execute(""" CREATE VIEW {view_name} AS
+SELECT {expr_point} AS expr FROM {rel_source}
+""".format(**locals()))
+return view_name,True
+else:
+plpy.error(
+"""kmeans error: {expr_point} does not exist in 
{rel_source}!
+""".format(**locals()))
+return rel_source, False
+
--- End diff --

It would be kinda nice to pre-compile the re by sticking a 
`kmeans_validate_expr.p = re.compile(...)` down here, as per 
http://stackoverflow.com/questions/279561/what-is-the-python-equivalent-of-static-variables-inside-a-function.
 Others might well have different ideas about that, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #89: K-means: support for array input

2017-01-23 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/89#discussion_r97462912
  
--- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
@@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, 
**kwargs):
 
 # --
 
+def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
+if not columns_exist_in_table(rel_source, [expr_point]):
+
+p = re.compile('[Aa][Rr][Rr][Aa][Yy]\s*\[\s*[\s*\w|,\s*]+\s*\]')
--- End diff --

A comment on what the regex is doing would be very helpful. The array part 
is pretty obvious (though, couldn't that just be done with an insensitive 
regex??), but the rest is more difficult to follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #89: K-means: support for array input

2017-01-23 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/89#discussion_r97462821
  
--- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
@@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, 
**kwargs):
 
 # --
 
+def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
--- End diff --

This function could really use some comments or a docstring about what it's 
supposed to be doing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #78: Graph: SSSP

2016-12-16 Thread decibel
Github user decibel commented on the issue:

https://github.com/apache/incubator-madlib/pull/78
  
Ok, I think I finally figured out what's going on here... you're keeping a 
table of every possible destination, as well as the minimum cost to that 
destination seen so far. The rest of this is just a question of walking through 
the edges, throwing away any paths that would exceed the cost to a destination 
that's already been seen, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-16 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92910394
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,372 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs using the Bellman-Ford
+algorhtm [1].
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+
+[1] https://en.wikipedia.org/wiki/Bellman-Ford_algorithm
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+   EPSILON = 1.0E-06
+
+   message = unique_string(desp='message')
+
+   oldupdate = unique_string(desp='oldupdate')
+   newupdate = unique_string(desp='newupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS {0},{1},{2}".format(
+   message,oldupdate,newupdate))
+
+   plpy.execute(
+   """ CREATE TABLE {out_table} AS
+   SELECT {vertex_id}::INT AS {vertex_id},
+   CAST('Infinity' AS DOUBLE PRECISION) AS 
{weight},
+   CAST({INT_MAX} AS INT) AS parent
--- End diff --

I think it'd be better to make that NULL instead of int_max. You can do 
that via `NULL::int AS parent`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-16 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92910625
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,372 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs using the Bellman-Ford
+algorhtm [1].
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+
+[1] https://en.wikipedia.org/wiki/Bellman-Ford_algorithm
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+   EPSILON = 1.0E-06
+
+   message = unique_string(desp='message')
+
+   oldupdate = unique_string(desp='oldupdate')
+   newupdate = unique_string(desp='newupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS {0},{1},{2}".format(
+   message,oldupdate,newupdate))
+
+   plpy.execute(
+   """ CREATE TABLE {out_table} AS
+   SELECT {vertex_id}::INT AS {vertex_id},
+   CAST('Infinity' AS DOUBLE PRECISION) AS 
{weight},
+   CAST({INT_MAX} AS INT) AS parent
+   FROM {vertex_table} {distribution} 
""".format(**locals()))
+   plpy.execute(
+   """ CREATE TEMP TABLE {oldupdate}(
+   id INT, val DOUBLE PRECISION, parent INT)
+   {local_distribution}
+ 

[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-16 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92910595
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,372 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs using the Bellman-Ford
+algorhtm [1].
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+
+[1] https://en.wikipedia.org/wiki/Bellman-Ford_algorithm
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+   EPSILON = 1.0E-06
+
+   message = unique_string(desp='message')
+
+   oldupdate = unique_string(desp='oldupdate')
+   newupdate = unique_string(desp='newupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS {0},{1},{2}".format(
+   message,oldupdate,newupdate))
+
+   plpy.execute(
+   """ CREATE TABLE {out_table} AS
+   SELECT {vertex_id}::INT AS {vertex_id},
+   CAST('Infinity' AS DOUBLE PRECISION) AS 
{weight},
+   CAST({INT_MAX} AS INT) AS parent
+   FROM {vertex_table} {distribution} 
""".format(**locals()))
+   plpy.execute(
+   """ CREATE TEMP TABLE {oldupdate}(
+   id INT, val DOUBLE PRECISION, parent INT)
+   {local_distribution}
+ 

[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-16 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92910466
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,372 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs using the Bellman-Ford
+algorhtm [1].
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+
+[1] https://en.wikipedia.org/wiki/Bellman-Ford_algorithm
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+   EPSILON = 1.0E-06
+
+   message = unique_string(desp='message')
+
+   oldupdate = unique_string(desp='oldupdate')
+   newupdate = unique_string(desp='newupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS {0},{1},{2}".format(
+   message,oldupdate,newupdate))
+
+   plpy.execute(
+   """ CREATE TABLE {out_table} AS
+   SELECT {vertex_id}::INT AS {vertex_id},
+   CAST('Infinity' AS DOUBLE PRECISION) AS 
{weight},
+   CAST({INT_MAX} AS INT) AS parent
+   FROM {vertex_table} {distribution} 
""".format(**locals()))
+   plpy.execute(
+   """ CREATE TEMP TABLE {oldupdate}(
+   id INT, val DOUBLE PRECISION, parent INT)
+   {local_distribution}
+ 

[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-16 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92910546
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,372 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs using the Bellman-Ford
+algorhtm [1].
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+
+[1] https://en.wikipedia.org/wiki/Bellman-Ford_algorithm
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+   EPSILON = 1.0E-06
+
+   message = unique_string(desp='message')
+
+   oldupdate = unique_string(desp='oldupdate')
+   newupdate = unique_string(desp='newupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS {0},{1},{2}".format(
+   message,oldupdate,newupdate))
+
+   plpy.execute(
+   """ CREATE TABLE {out_table} AS
+   SELECT {vertex_id}::INT AS {vertex_id},
+   CAST('Infinity' AS DOUBLE PRECISION) AS 
{weight},
+   CAST({INT_MAX} AS INT) AS parent
+   FROM {vertex_table} {distribution} 
""".format(**locals()))
+   plpy.execute(
+   """ CREATE TEMP TABLE {oldupdate}(
+   id INT, val DOUBLE PRECISION, parent INT)
+   {local_distribution}
+ 

[GitHub] incubator-madlib issue #78: Graph: SSSP

2016-12-16 Thread decibel
Github user decibel commented on the issue:

https://github.com/apache/incubator-madlib/pull/78
  
You should really add that as a comment to the code. It's much clearer than 
reading the code itself.

And yes, it definitely helped me. I now realize what's been bugging me 
about this... the original algorithm on wikipedia is extremely general purpose; 
so much so that I missed an important fact: **a vertex should never need to be 
updated twice**, so long as you consider **all** edges of the vertex 
simultaneously. Of course, doing that is regular code would be a pain, but in 
SQL it's trivial.

Now that I understand what's going on, I see that the real trick here is to 
keep track of what the last set of updates was. I think it'd probably be a lot 
simpler to just keep a field that is what nesting level we're on, which is how 
this would end up coded in a CTE. I've tried a couple times to code that, but 
still haven't wrapped my head around it because of how far away from set theory 
the pseudo code is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #78: Graph: SSSP

2016-12-15 Thread decibel
Github user decibel commented on the issue:

https://github.com/apache/incubator-madlib/pull/78
  
Sounds good. I'd still like to see pseudocode as part of the 
documentation, because I don't think all these steps are necessary.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #78: Graph: SSSP

2016-12-12 Thread decibel
Github user decibel commented on the issue:

https://github.com/apache/incubator-madlib/pull/78
  
Yeah, the 1GB limit is certainly a consideration. Using int4's for 
everything, a composite of (srt, dest, weight) would be 13-16 bytes (3 * int4 = 
12 + varlena and maybe alignment). With the worst case of 16 bytes, that would 
be 65k edges, so that's the maximum that could be updated at once.

Before coding a full implementation, could you produce pseudocode of the 
ideal minimum set-based implementation would be? In particular, I'm thinking 
that {messages} is completely redundant, and if that's the case I suspect this 
entire algorithm could be done in a recursive CTE.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-12 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92069133
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,347 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+
+   message = unique_string(desp='message')
+   toupdate = unique_string(desp='toupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS 
{0},{1}".format(message,toupdate))
--- End diff --

Well, the issue exists throughout the code. Maybe it's just not worth 
worrying about... this code can only be called from inside the database anyway, 
right? If you're already in the database, you don't need SQL injection to break 
things, UNLESS the function is SECURITY DEFINER (which presumably this is not).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-12 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92044676
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,347 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+"""
+
--- End diff --

There should be some kind of documentation about the algorithm used; at 
least a reference to the wiki article.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-12 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92041602
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,347 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+
+   message = unique_string(desp='message')
+   toupdate = unique_string(desp='toupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS 
{0},{1}".format(message,toupdate))
+
+   plpy.execute(
+   """ CREATE TABLE {out_table} AS
+   SELECT {vertex_id}::INT AS {vertex_id},
+   CAST({INT_MAX} AS INT) AS {weight},
+   CAST({INT_MAX} AS INT) AS parent
+   FROM {vertex_table} {distribution} 
""".format(**locals()))
+
+   plpy.execute(
+   """ CREATE TEMP TABLE {message}(
+   id INT, val INT, parent INT)
+   {local_distribution} """.format(**locals()))
+   plpy.execute(
+   """ CREATE TEMP TABLE {toupdate}(
+   id INT, val INT, parent INT)
+   {local_distribution} """.format(**locals()))
  

[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-12 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92051375
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,347 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+
+   message = unique_string(desp='message')
+   toupdate = unique_string(desp='toupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS 
{0},{1}".format(message,toupdate))
+
+   plpy.execute(
+   """ CREATE TABLE {out_table} AS
+   SELECT {vertex_id}::INT AS {vertex_id},
+   CAST({INT_MAX} AS INT) AS {weight},
+   CAST({INT_MAX} AS INT) AS parent
+   FROM {vertex_table} {distribution} 
""".format(**locals()))
+
+   plpy.execute(
+   """ CREATE TEMP TABLE {message}(
--- End diff --

Ok, after reading [2], I see why this is called messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-12 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92054290
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,347 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+
+   message = unique_string(desp='message')
+   toupdate = unique_string(desp='toupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS 
{0},{1}".format(message,toupdate))
+
+   plpy.execute(
+   """ CREATE TABLE {out_table} AS
+   SELECT {vertex_id}::INT AS {vertex_id},
+   CAST({INT_MAX} AS INT) AS {weight},
+   CAST({INT_MAX} AS INT) AS parent
+   FROM {vertex_table} {distribution} 
""".format(**locals()))
+
+   plpy.execute(
+   """ CREATE TEMP TABLE {message}(
+   id INT, val INT, parent INT)
+   {local_distribution} """.format(**locals()))
+   plpy.execute(
+   """ CREATE TEMP TABLE {toupdate}(
+   id INT, val INT, parent INT)
+   {local_distribution} """.format(**locals()))
  

[GitHub] incubator-madlib pull request #78: Graph: SSSP

2016-12-12 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/78#discussion_r92040714
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -0,0 +1,347 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Single Source Shortest Path
+
+# Please refer to the sssp.sql_in file for the documentation
+
+"""
+@file sssp.py_in
+
+@namespace graph
+"""
+
+import plpy
+from utilities.control import MinWarning
+from utilities.utilities import _assert
+from utilities.utilities import extract_keyvalue_params
+from utilities.utilities import unique_string
+from utilities.validate_args import get_cols
+from utilities.validate_args import unquote_ident
+from utilities.validate_args import table_exists
+from utilities.validate_args import columns_exist_in_table
+from utilities.validate_args import table_is_empty
+
+m4_changequote(`')
+
+def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table,
+   edge_args, source_vertex, out_table, **kwargs):
+   """
+Single source shortest path function for graphs
+Args:
+@param vertex_table Name of the table that contains the vertex 
data.
+@param vertex_idName of the column containing the vertex 
ids.
+@param edge_table   Name of the table that contains the edge 
data.
+@param edge_argsA comma-delimited string containing 
multiple
+   named arguments of the 
form "name=value".
+@param source_vertexThe source vertex id for the algorithm to 
start.
+@param out_table   Name of the table to store the 
result of SSSP.
+"""
+
+   with MinWarning("warning"):
+
+   INT_MAX = 2147483647
+
+   message = unique_string(desp='message')
+   toupdate = unique_string(desp='toupdate')
+
+   params_types = {'src': str, 'dest': str, 'weight': str}
+   default_args = {'src': 'src', 'dest': 'dest', 'weight': 
'weight'}
+   edge_params = extract_keyvalue_params(edge_args,
+params_types,
+default_args)
+   if vertex_id is None:
+   vertex_id = "id"
+
+   src = edge_params["src"]
+   dest = edge_params["dest"]
+   weight = edge_params["weight"]
+
+   distribution = m4_ifdef(, ,
+   )
+   local_distribution = m4_ifdef(, ,
+   )
+
+   validate_graph_coding(vertex_table, vertex_id, edge_table,
+   edge_params, source_vertex, out_table)
+
+   plpy.execute(" DROP TABLE IF EXISTS 
{0},{1}".format(message,toupdate))
--- End diff --

There's a SQL-injection risk here. Additionally, this assumes that 
`message` and `toupdate` are correctly quoted. But maybe that's OK in this 
context...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

2016-06-21 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/49#discussion_r67954746
  
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, 
output_table, partition_expr,
 @param source_table: str, Name of the input table/view
 @param output_table: str, Name of the table to store result
 @param partition_expr: str, Expression to partition (group) the 
input data
-@param time_stamp: str, Column name with time used for 
sessionization calculation
+@param time_stamp: str, The time stamp column name that is used 
for sessionization calculation
 @param max_time: interval, Delta time between subsequent events to 
define a session
-
+@param output_cols: str, list of columns the output table/view 
must contain (default '*'):
+* - all columns in the input table, and a new 
session ID column
+'a,b,c,...' -  a comma separated list of column 
names/expressions to be projected, along with a new session ID column
--- End diff --

> We will have to decide if this is too hard a constraint to have or not.
> Having this constraint will take away all the messy string parsing stuff
> we currently have implemented though.

Yeah, I didn't realize what the code was doing. I think it would be nice 
to allow the user to choose different output names if they want.

Perhaps a good compromise would be to detect ' AS ' in the string and 
then treat it as a raw select clause. Another option would be to treat 
an array as a list of columns and anything else as a select clause 
(which would also support the * case, I think).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

2016-06-21 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/49#discussion_r67954244
  
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, 
output_table, partition_expr,
 @param source_table: str, Name of the input table/view
 @param output_table: str, Name of the table to store result
 @param partition_expr: str, Expression to partition (group) the 
input data
-@param time_stamp: str, Column name with time used for 
sessionization calculation
+@param time_stamp: str, The time stamp column name that is used 
for sessionization calculation
 @param max_time: interval, Delta time between subsequent events to 
define a session
-
+@param output_cols: str, list of columns the output table/view 
must contain (default '*'):
+* - all columns in the input table, and a new 
session ID column
+'a,b,c,...' -  a comma separated list of column 
names/expressions to be projected, along with a new session ID column
+@param create_view: boolean, indicates if the output is a view or 
a table with name specified by output_table (default TRUE)
+TRUE - create view
+FALSE - materialize results into a table
 """
 with MinWarning("error"):
 _validate(source_table, output_table, partition_expr, time_stamp, 
max_time)
+table_or_view = 'VIEW' if create_view or create_view is None else 
'TABLE'
+output_cols_to_project = '*' if output_cols is None else 
output_cols
 
-all_input_cols_str = ', '.join([i.strip() for i in 
get_cols(source_table, schema_madlib)])
+cols_to_project = get_column_names(schema_madlib, source_table, 
output_cols_to_project)
 session_id = 'session_id' if not is_var_valid(source_table, 
'session_id') else unique_string('session_id')
 
 # Create temp column names for intermediate columns.
 new_partition = unique_string('new_partition')
 new_session = unique_string('new_session')
 
 plpy.execute("""
-CREATE TABLE {output_table} AS
+CREATE {table_or_view} {output_table} AS
 SELECT
-{all_input_cols_str},
+{cols_to_project},
 CASE WHEN {time_stamp} IS NOT NULL
- THEN SUM(CASE WHEN {new_partition} OR 
{new_session} THEN 1 END)
-  OVER (PARTITION BY {partition_expr}
-  ORDER BY {time_stamp})
-END AS {session_id}
+THEN SUM(CASE WHEN {new_partition} OR 
{new_session} THEN 1 END) OVER (PARTITION BY {partition_expr} ORDER BY 
{time_stamp}) END AS {session_id}
 FROM (
 SELECT *,
-ROW_NUMBER() OVER (w) = 1
-AND {time_stamp} IS NOT NULL AS 
{new_partition},
-({time_stamp} - LAG({time_stamp}, 1)
-OVER (w)) > '{max_time}'::INTERVAL AS 
{new_session}
-FROM {source_table}
-WINDOW w AS (PARTITION BY {partition_expr}
- ORDER BY {time_stamp})
+ROW_NUMBER() OVER (w) = 1 AND {time_stamp} IS 
NOT NULL AS {new_partition},
+({time_stamp}-LAG({time_stamp}, 1) OVER (w)) > 
'{max_time}'::INTERVAL AS {new_session}
+FROM {source_table} WINDOW w AS (PARTITION BY 
{partition_expr} ORDER BY {time_stamp})
 ) a
 """.format(**locals()))
 
+def get_column_names(schema_madlib, source_table, output_cols):
+"""
+This method creates a string that can be used in the SQL statement 
to project the columns specified in the output_cols parameter.
+
+Return:
+a string to be used in the SQL statement
+"""
+table_columns_list = get_cols(source_table, schema_madlib)
+if output_cols.strip() == '*':
+output_cols_str = get_cols_str(table_columns_list)
+else:
+output_cols_list, output_cols_names = 
get_columns_from_expression(output_cols, table_columns_list)
+_validate_output_cols(source_table, output_cols_list)
+output_cols_str = ', '.join([output_cols_names[i] if 
output_cols_list[i] == '*' else output_cols_list[i] + ' AS ' + 
out

[GitHub] incubator-madlib pull request #47: Feature: Pivot Function

2016-06-21 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/47#discussion_r67944965
  
--- Diff: src/ports/postgres/modules/utilities/pivot.sql_in ---
@@ -0,0 +1,202 @@
+/* --- 
*//**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ *
+ * @file pivot.sql_in
+ *
+ * @brief SQL functions for pivoting
+ * @date June 2014
+ *
+ * @sa Creates a pivot table for data summarization.
+ *
+ *//* 
--- */
+
+m4_include(`SQLCommon.m4')
+
+/**
+@addtogroup grp_pivot
+
+Contents
+
+Pivoting
+Notes
+Examples
+
+
+
+@brief Provides pivoting functions helpful for data preparation before 
modeling
+
+@anchor categorical
+The goal of the MADlib pivot function is to provide a data summarization 
tool 
+that can do basic OLAP type operations on data stored in one table and 
output 
+the summarized data to a second table.  
+
+
+
+pivot(
+   source_table,
+out_table,
+index,
+pivot_cols,
+pivot_values
+)
+
+\b Arguments
+
+source_table
+VARCHAR. Name of the source table, containing data for 
pivoting.
+output_table
+VARCHAR. Name of output table taht contains pivoted data. 
+The output table ('output_table' above) has all the columns present in 
+index column list, plus additional columns for each distinct value in 
+pivot_cols.The column name for the pivot is set as 
+'pivot name'_'pivot value'.
+
+index 
+VARCHAR. Comma-separated columns that will form the index of the 
output 
+pivot table.
+pivot_cols 
+VARCHAR. Comma-separated columns that will form the columns of the 
+output pivot table.
+pivot_values 
+VARCHAR. Comma-separated columns that contain the values to be 
+summarized in the output pivot table.
+
+
+
+@anchor notes
+@par Notes
+
+The default aggregate function is "sum". 
+
+NULL values in the index column are treated as any other value. 
+
+NULL values in the pivot column are ignored.
+
+NULL values in the value column are handled by the aggregate function.
+
+The following features are planned but not yet implemented.
+
+- Multiple index columns.
+- Multiple pivot columns.
+- Multiple value columns.
+- Aggregate functions as input.
+- NULL values in the pivot.
+
+
+@anchor examples
+@examp
+
+-#  Create a toy dataset.
+
+CREATE TABLE pivset(
+  id INTEGER,
+  piv FLOAT8,
+  val FLOAT8
+);
+INSERT INTO pivset VALUES
+   (0, 10, 1),
+   (0, 10, 2),
+   (0, 20, 3),
+   (1, 20, 4),
+   (1, 30, 5),
+   (1, 30, 6),
+   (1, 10, 7),
+   (NULL, 10, 8),
+   (1, NULL, 9),
+   (1, 10, NULL);
+
+
+-# Pivot the table
+
+DROP TABLE IF EXISTS pivout;
+SELECT madlib.pivot('pivset', 'pivout', 'id', 'piv', 'val');
+SELECT * FROM pivout;
+
+
+ id | piv_10.0 | piv_20.0 | piv_30.0
+--+++
+  0 |3 |3 |
+  1 |7 |4 |   11
+|8 |0 |0
+
+*/
+
+-
+
+
+/**
+ * @brief Helper function that can be used to pivot tables
+ *
+ * @param source_table The original data table
+ * @param out_tableThe output table that contains the dummy
+ * variable columns
+ * @param indexThe index columns to group by the records by
+ * @param pivot_cols   The columns to pivot the table
+ * @pa

[GitHub] incubator-madlib pull request #47: Feature: Pivot Function

2016-06-21 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/47#discussion_r67943850
  
--- Diff: src/ports/postgres/modules/utilities/pivot.py_in ---
@@ -0,0 +1,201 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Pivoting
+# The goal of the MADlib pivot function is to provide a data summarization 
tool
+# that can do basic OLAP type operations on data stored in one table and 
output
+# the summarized data to a second table.  Typical operations are count, 
average,
+# min, max and standard deviation, however user defined aggregates (UDAs) 
are
+# also be allowed.
+
+# Please refer to the pivot.sql_in file for the documentation
+
+"""
+@file pivot.py_in
+
+"""
+import plpy
+from utilities import _assert
+from utilities import split_quoted_delimited_str
+from utilities import strip_end_quotes
+from validate_args import table_exists
+from validate_args import columns_exist_in_table
+from validate_args import table_is_empty
+from validate_args import _get_table_schema_names
+from validate_args import get_first_schema
+
+m4_changequote(`')
+
+def pivot(schema_madlib, source_table, out_table,
+ index, pivot_cols, pivot_values,
+ aggregate_func, **kwargs):
+"""
+Helper function that can be used to pivot tables
+Args:
+@param source_table The original data table
+@param out_tableThe output table that contains the dummy
+variable columns
+@param indexThe index columns to group by the records 
by
+@param pivot_cols   The columns to pivot the table
+@param pivot_values The value columns to be summarized in the
+pivoted table
+@param aggregate_func   The aggregate function to be applied to the
+values
+
+"""
+indices = split_quoted_delimited_str(index)
+pcols = split_quoted_delimited_str(pivot_cols)
+pvals = split_quoted_delimited_str(pivot_values)
+# aggregate_func = "sum"
+validate_pivot_coding(source_table, out_table, indices, pcols, pvals)
+new_col_names =[]
+sql_list = ["CREATE TABLE " + out_table + " AS (SELECT " + index]
+# Preperation for multiple index, pivot, etc.
--- End diff --

It would be useful to have a comment/docstring explaining the form of the 
query that we're trying to build. Comments on some of the individual constructs 
would probably be good too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #47: Feature: Pivot Function

2016-06-21 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/47#discussion_r67943517
  
--- Diff: src/ports/postgres/modules/utilities/pivot.py_in ---
@@ -0,0 +1,201 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Pivoting
+# The goal of the MADlib pivot function is to provide a data summarization 
tool
+# that can do basic OLAP type operations on data stored in one table and 
output
+# the summarized data to a second table.  Typical operations are count, 
average,
+# min, max and standard deviation, however user defined aggregates (UDAs) 
are
+# also be allowed.
+
+# Please refer to the pivot.sql_in file for the documentation
+
+"""
+@file pivot.py_in
+
+"""
+import plpy
+from utilities import _assert
+from utilities import split_quoted_delimited_str
+from utilities import strip_end_quotes
+from validate_args import table_exists
+from validate_args import columns_exist_in_table
+from validate_args import table_is_empty
+from validate_args import _get_table_schema_names
+from validate_args import get_first_schema
+
+m4_changequote(`')
+
+def pivot(schema_madlib, source_table, out_table,
+ index, pivot_cols, pivot_values,
+ aggregate_func, **kwargs):
+"""
+Helper function that can be used to pivot tables
+Args:
+@param source_table The original data table
+@param out_tableThe output table that contains the dummy
+variable columns
+@param indexThe index columns to group by the records 
by
+@param pivot_cols   The columns to pivot the table
+@param pivot_values The value columns to be summarized in the
+pivoted table
+@param aggregate_func   The aggregate function to be applied to the
+values
+
+"""
+indices = split_quoted_delimited_str(index)
+pcols = split_quoted_delimited_str(pivot_cols)
+pvals = split_quoted_delimited_str(pivot_values)
+# aggregate_func = "sum"
+validate_pivot_coding(source_table, out_table, indices, pcols, pvals)
+new_col_names =[]
+sql_list = ["CREATE TABLE " + out_table + " AS (SELECT " + index]
+# Preperation for multiple index, pivot, etc.
+for pcol in pcols:
+for pval in pvals:
+pcol_no_quotes = strip_end_quotes(pcol.strip())
+pval_no_quotes = strip_end_quotes(pval.strip())
+distinct_values = plpy.execute(
+"SELECT {pcol} AS value FROM {source_table} "
+"GROUP BY {pcol} ORDER BY {pcol}".
+format(pcol=pcol, source_table=source_table))
+distinct_values = [strip_end_quotes(item['value']) for item in 
distinct_values]
--- End diff --

I'm not sure what strip_end_quotes does, but I suspect what you want to be 
doing here is using quote_ident(). You could incorporate it into the distinct 
values query:

`SELECT *, quote_ident(column1_values) AS column1_quoted, 
quote_ident(column2_values) AS column2_quoted FROM () d;`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #49: Feature: Sessionize funtion - Phase 2

2016-06-21 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/49#discussion_r67937714
  
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -35,41 +36,83 @@ def sessionize(schema_madlib, source_table, 
output_table, partition_expr,
 @param source_table: str, Name of the input table/view
 @param output_table: str, Name of the table to store result
 @param partition_expr: str, Expression to partition (group) the 
input data
-@param time_stamp: str, Column name with time used for 
sessionization calculation
+@param time_stamp: str, The time stamp column name that is used 
for sessionization calculation
 @param max_time: interval, Delta time between subsequent events to 
define a session
-
+@param output_cols: str, list of columns the output table/view 
must contain (default '*'):
+* - all columns in the input table, and a new 
session ID column
+'a,b,c,...' -  a comma separated list of column 
names/expressions to be projected, along with a new session ID column
--- End diff --

Would it be better to accept a generator if someone wanted to list column 
names? That certainly seems more pythonic than a string...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

2016-06-02 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/44#discussion_r65610699
  
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -0,0 +1,101 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+import string
+
+from control import MinWarning
+from utilities import unique_string, _assert
+from validate_args import get_cols
+from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
+
+m4_changequote(`')
+
+def sessionize(schema_madlib, source_table, output_table, partition_expr,
+   time_stamp, time_out, **kwargs):
+   """
+   Perform sessionization over a sequence of rows.
+
+   Args:
+   @param schema_madlib: str, Name of the MADlib schema
+   @param source_table: str, Name of the input table/view
+   @param output_table: str, Name of the table to store result
+   @param partition_expr: str, Expression to partition (group) the 
input data
+   @param time_stamp: float, Column name with time used for 
sessionization calculation
+   @param time_out: float, Delta time between subsequent events to 
define a sessions
+   
+   """
+   with MinWarning("error"):
+   _validate(source_table, output_table, partition_expr, 
time_stamp, time_out)
+
+   all_input_cols_str = ', '.join([i.strip() for i in 
get_cols(source_table, schema_madlib)])
+   session_id = 'session_id' if not is_var_valid(source_table, 
'session_id') else unique_string('session_id')
+
+   plpy.execute("""
+   CREATE TABLE {output_table} AS
+   SELECT
+   {all_input_cols_str},
+   CASE WHEN {time_stamp} NOTNULL
+   THEN (SUM(new_event_boundary) 
OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
+   FROM (
+   SELECT *, 
+   CASE WHEN {time_stamp} 
NOTNULL and ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}' OR 
ROW_NUMBER() OVER (w) = '1')
--- End diff --

> I think we can just mandate the {time_out} and {min_time} parameters to
> be of type
> interval, and of course, cast it to interval in the query.

Just to clarify my original intent, I was thinking of something like:

try:
   float(time_out)
   time_out_is_number = True
except ...:
   time_out_is_number = False

But I agree that just passing it directly to Postgres and letting it 
cast it to an interval is the simplest and best approach.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

2016-06-02 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/44#discussion_r65589213
  
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -0,0 +1,101 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+import string
+
+from control import MinWarning
+from utilities import unique_string, _assert
+from validate_args import get_cols
+from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
+
+m4_changequote(`')
+
+def sessionize(schema_madlib, source_table, output_table, partition_expr,
+   time_stamp, time_out, **kwargs):
+   """
+   Perform sessionization over a sequence of rows.
+
+   Args:
+   @param schema_madlib: str, Name of the MADlib schema
+   @param source_table: str, Name of the input table/view
+   @param output_table: str, Name of the table to store result
+   @param partition_expr: str, Expression to partition (group) the 
input data
+   @param time_stamp: float, Column name with time used for 
sessionization calculation
+   @param time_out: float, Delta time between subsequent events to 
define a sessions
+   
+   """
+   with MinWarning("error"):
+   _validate(source_table, output_table, partition_expr, 
time_stamp, time_out)
+
+   all_input_cols_str = ', '.join([i.strip() for i in 
get_cols(source_table, schema_madlib)])
+   session_id = 'session_id' if not is_var_valid(source_table, 
'session_id') else unique_string('session_id')
+
+   plpy.execute("""
+   CREATE TABLE {output_table} AS
+   SELECT
+   {all_input_cols_str},
+   CASE WHEN {time_stamp} NOTNULL
+   THEN (SUM(new_event_boundary) 
OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
+   FROM (
+   SELECT *, 
+   CASE WHEN {time_stamp} 
NOTNULL and ({time_stamp}-LAG({time_stamp},1) OVER (w) > '{time_out}' OR 
ROW_NUMBER() OVER (w) = '1')
--- End diff --

> Do you think we should mandate time_out/min_time to always be of type
> interval? Another option is to force it to be of type float
> (representing seconds), then we may have to use epoch() while computing
> the time_stamp diffs.

On the Postgres side, interval is what's wanted.

On the python side, I could see use for either of them. float might be 
the simplest, but there are options available with interval that you 
don't have with seconds, because interval tracks seconds, days, and 
months separately. In this specific case I can't see days or months 
being very useful, but from an overall MADlib API standpoint I think 
it's best if anything that accepts a time delta can accept an interval.

Code-wise, I think the best way to handle this is to see if the python 
interval parameter is some kind of number and if it is pass it directly 
to Postgres and multiply it by a 1 second interval, ie: "{time_out} * 
interval '1 second'". Postgres will do the correct thing regardless of 
what numeric type it is (int vs numeric vs float). Otherwise just assume 
it's something that can be cast to an interval and do 
"{time_out}::interval".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

2016-06-01 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/44#discussion_r65454766
  
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -0,0 +1,101 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+import string
+
+from control import MinWarning
+from utilities import unique_string, _assert
+from validate_args import get_cols
+from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
+
+m4_changequote(`')
+
+def sessionize(schema_madlib, source_table, output_table, partition_expr,
+   time_stamp, time_out, **kwargs):
+   """
+   Perform sessionization over a sequence of rows.
+
+   Args:
+   @param schema_madlib: str, Name of the MADlib schema
+   @param source_table: str, Name of the input table/view
+   @param output_table: str, Name of the table to store result
+   @param partition_expr: str, Expression to partition (group) the 
input data
+   @param time_stamp: float, Column name with time used for 
sessionization calculation
+   @param time_out: float, Delta time between subsequent events to 
define a sessions
+   
+   """
+   with MinWarning("error"):
+   _validate(source_table, output_table, partition_expr, 
time_stamp, time_out)
+
+   all_input_cols_str = ', '.join([i.strip() for i in 
get_cols(source_table, schema_madlib)])
+   session_id = 'session_id' if not is_var_valid(source_table, 
'session_id') else unique_string('session_id')
+
+   plpy.execute("""
+   CREATE TABLE {output_table} AS
+   SELECT
+   {all_input_cols_str},
+   CASE WHEN {time_stamp} NOTNULL
+   THEN (SUM(new_event_boundary) 
OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
--- End diff --

> If the min_time solution you have proposed in
> https://issues.apache.org/jira/browse/MADLIB-1002 works fine, do you
> think we should merge the sessionization phases 1 and 3?

Seems logical to me, but I'm not the one doing the work... ;)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

2016-06-01 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/44#discussion_r65437304
  
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -0,0 +1,101 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+import string
+
+from control import MinWarning
+from utilities import unique_string, _assert
+from validate_args import get_cols
+from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
+
+m4_changequote(`')
+
+def sessionize(schema_madlib, source_table, output_table, partition_expr,
+   time_stamp, time_out, **kwargs):
+   """
+   Perform sessionization over a sequence of rows.
+
+   Args:
+   @param schema_madlib: str, Name of the MADlib schema
+   @param source_table: str, Name of the input table/view
+   @param output_table: str, Name of the table to store result
+   @param partition_expr: str, Expression to partition (group) the 
input data
+   @param time_stamp: float, Column name with time used for 
sessionization calculation
+   @param time_out: float, Delta time between subsequent events to 
define a sessions
+   
+   """
+   with MinWarning("error"):
+   _validate(source_table, output_table, partition_expr, 
time_stamp, time_out)
+
+   all_input_cols_str = ', '.join([i.strip() for i in 
get_cols(source_table, schema_madlib)])
+   session_id = 'session_id' if not is_var_valid(source_table, 
'session_id') else unique_string('session_id')
+
+   plpy.execute("""
+   CREATE TABLE {output_table} AS
+   SELECT
+   {all_input_cols_str},
+   CASE WHEN {time_stamp} NOTNULL
+   THEN (SUM(new_event_boundary) 
OVER (PARTITION BY {partition_expr} ORDER BY {time_stamp})) END AS {session_id}
--- End diff --

I did some refactoring of this while taking a look at 
https://issues.apache.org/jira/browse/MADLIB-1002. Please take a look at it, as 
I think it's clearer than this code. *Note that I have not tested it for 
correctness!*


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #44: Feature: Sessionize funtion

2016-06-01 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/44#discussion_r65424347
  
--- Diff: src/ports/postgres/modules/utilities/sessionize.py_in ---
@@ -0,0 +1,101 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import plpy
+import string
+
+from control import MinWarning
+from utilities import unique_string, _assert
+from validate_args import get_cols
+from validate_args import input_tbl_valid, output_tbl_valid, is_var_valid
+
+m4_changequote(`')
+
+def sessionize(schema_madlib, source_table, output_table, partition_expr,
+   time_stamp, time_out, **kwargs):
+   """
+   Perform sessionization over a sequence of rows.
+
+   Args:
+   @param schema_madlib: str, Name of the MADlib schema
+   @param source_table: str, Name of the input table/view
+   @param output_table: str, Name of the table to store result
+   @param partition_expr: str, Expression to partition (group) the 
input data
+   @param time_stamp: float, Column name with time used for 
sessionization calculation
+   @param time_out: float, Delta time between subsequent events to 
define a sessions
+   
+   """
+   with MinWarning("error"):
+   _validate(source_table, output_table, partition_expr, 
time_stamp, time_out)
+
+   all_input_cols_str = ', '.join([i.strip() for i in 
get_cols(source_table, schema_madlib)])
+   session_id = 'session_id' if not is_var_valid(source_table, 
'session_id') else unique_string('session_id')
+
+   plpy.execute("""
+   CREATE TABLE {output_table} AS
+   SELECT
+   {all_input_cols_str},
+   CASE WHEN {time_stamp} NOTNULL
--- End diff --

Is it necessary to include rows with NULL timestamps? While that's not a 
big deal here, I think it's going to lead to unwanted complexity down the road 
(I'm looking at JIRA MADLIB-1002).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Feature: Sessionize funtion

2016-05-31 Thread decibel
Github user decibel commented on the pull request:

https://github.com/apache/incubator-madlib/pull/44
  
I'm not sure why params is better than explicit parameters, but lets move 
that discussion to the new Jira.

I didn't do a full review, but the database stuff looks sane.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Feature: Sessionize funtion

2016-05-31 Thread decibel
Github user decibel commented on the pull request:

https://github.com/apache/incubator-madlib/pull/44
  
Has consideration been given to allowing for the creation of a view instead 
of a table? If you only needed session info once, or only for a subset of 
values in the input table, that could be significantly faster than 
materializing it.

Related to that... it could also be faster to only materialize the 
partition columns, series number and timespan for each series. You would need 
to join the base data against that, but in some cases it could be faster. A 
variation on this would be to specify the exact columns you want materialized.

So perhaps add "create_view = False, materialize_columns = None" 
parameters, where materialize_columns == None means materialize the whole 
shebang, while materialize_columns = [] means only materialize the partition 
clause, session and timestamp.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread decibel
Github user decibel commented on the pull request:

https://github.com/apache/incubator-madlib/pull/41#issuecomment-216655743
  
I suggest starting with documentation and hold off a bit on the code. 
There might be some even better ways to do things that what I initially 
thought of; I'm hoping that docs will help me figure that out.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/41#discussion_r61940732
  
--- Diff: src/ports/postgres/modules/pred_metrics/pred_metrics.py_in ---
@@ -0,0 +1,391 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from utilities.utilities import unique_string
+import plpy
+
+def mean_abs_error(
+schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(ABS({prediction_col}- {observed_col}))
+/COUNT(*) AS mean_abs_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
+sql_st4= """ GROUP BY {grouping_cols}""".format(**locals())
+sql_st = sql_st1+sql_st2+sql_st3+sql_st4
+plpy.execute(sql_st)
+
+def mean_abs_perc_error(
+schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(ABS({prediction_col}- {observed_col})/{observed_col})
+/COUNT(*) AS mean_abs_perc_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
+sql_st4= """ GROUP BY {grouping_cols}""".format(**locals())
+sql_st = sql_st1+sql_st2+sql_st3+sql_st4
+plpy.execute(sql_st)
+
+def mean_perc_error(
+   schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(({prediction_col}- {observed_col})/{observed_col})
+/COUNT(*) AS mean_perc_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
+sql_st4= """ GROUP BY {grouping_cols}""".format(**locals())
+sql_st = sql_st1+sql_st2+sql_st3+sql_st4
+plpy.execute(sql_st)
+
+def mean_squared_error(
+   schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(({prediction_col}- {observed_col})^2)
+/COUNT(*) AS mean_squared_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
+sql_st4= """ GROUP BY {grouping_cols}""".format(**locals())
+sql_st = sql_st1+sql_st2+sql_st3+sql_st4
+plpy.execute(sql_st)
+
+
+def __r2_score(
--- End diff --

This scans {table_in} 3 times for no good reason. It would be better to use 
the query below, but I'm not sure that's even necessary. Do any of the 
functions at 
http://www.postgresql.org/docs/9.5/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE
 do what's necessary here?

CREATE TABLE {table_out} AS
SELECT 1 - ssres/sstot AS r2_score FROM (
SELECT sum(({prediction_col} - {observed_col})^2) AS ssres, sum(( 
{observed_col} - (SELECT SUM({observed_col})/count(*) AS mean FROM {table_in}) 
)^2) AS sstot FROM {table_in}
) intermediate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/41#discussion_r61935728
  
--- Diff: src/ports/postgres/modules/pred_metrics/pred_metrics.py_in ---
@@ -0,0 +1,391 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from utilities.utilities import unique_string
+import plpy
+
+def mean_abs_error(
+schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(ABS({prediction_col}- {observed_col}))
+/COUNT(*) AS mean_abs_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
--- End diff --

FWIW, normal convention is to put grouping columns first in the select 
clause.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Elastic Net: Skip arrays with NULL ...

2016-04-07 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/35#discussion_r58973285
  
--- Diff: src/ports/postgres/modules/convex/utils_regularization.py_in ---
@@ -47,14 +46,22 @@ def __utils_dep_var_scale(**kwargs):
 The output will be stored in a temp table: a mean array and a std array
 
 This function is also used in lasso.
+
+Parameters:
+schema_madlib -- madlib schema
+tbl_data -- original data
+col_ind_var -- independent variables column
+col_dep_var -- dependent variable column
 """
+
 y_scale = plpy.execute(
 """
 select
-avg({col_dep_var}) as mean,
+avg(case when not 
{schema_madlib}.array_contains_null({col_ind_var}) then {col_dep_var} end) as 
mean,
 1 as std
--- End diff --

That's not very defensive programming (assuming that the table column will 
always be NOT NULL).

BUT... I don't think that's necessary anyway. avg() should correctly handle 
NULLs:

```SQL
SELECT avg(unnest) FROM unnest('{1,2,NULL,NULL,NULL}'::int[]);
avg 

 1.5000
(1 row)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Path: Return results for each match

2016-03-24 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/29#discussion_r57410192
  
--- Diff: src/ports/postgres/modules/utilities/path.py_in ---
@@ -108,8 +107,11 @@ def path(schema_madlib, source_table, output_table, 
partition_expr,
 *,
 nextval('{seq_gen}') AS {id_col_name},
--- End diff --

What are you basing that on? I don't believe there's any significant 
performance benefit to sequences over row_number() (actually, a quick glance at 
the code leads me to think sequences will be slower). On top of that, creating 
a sequence adds to catalog bloat (not to mention the cost of creating and then 
dropping the sequence).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Path: Return results for each match

2016-03-20 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/29#discussion_r56779268
  
--- Diff: src/ports/postgres/modules/utilities/path.py_in ---
@@ -387,11 +395,16 @@ def _parse_symbol_str(symbol_expr, pattern_expr):
 lambda m: old_to_new[re.escape(string.lower(m.group(0)))],
 pattern_expr)
 
-old_sym_def_str = '\n'.join("WHEN {0} THEN '{1}'::text".format(v, k)
-for k, v in old_sym_definitions.items())
-new_sym_def_str = '\n'.join("WHEN {0} THEN '{1}'::text".format(v, k)
-for k, v in new_sym_definitions.items())
-return (new_pattern_expr, old_sym_def_str, new_sym_def_str)
+# build a case statement to search a tuple for each definition and 
pick the
+# appropriate symbol.
+orig_sym_case_stmt = []
+new_sym_case_stmt = []
+general_case_stmt = "WHEN {0} THEN '{1}'::text"
+for k in orig_symbols_ordered:
+
orig_sym_case_stmt.append(general_case_stmt.format(orig_sym_definitions[k], k))
+
new_sym_case_stmt.append(general_case_stmt.format(orig_sym_definitions[k],
+  
old_to_new[k.lower()]))
+return (new_pattern_expr, '\n'.join(orig_sym_case_stmt), 
'\n'.join(new_sym_case_stmt))
--- End diff --

I believe the example and returns sections of the function header need to 
be updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---