[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/195


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r152136163
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -709,16 +709,35 @@ def _check_groups(tbl1, tbl2, grp_list):
 return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals())
  for i in grp_list])
 
-
-def _grp_from_table(tbl, grp_list):
-"""
-Helper function for selecting grouping columns of a table
+def get_filtered_cols_subquery_str(include_from_table, exclude_from_table,
+   filter_cols_list):
+"""
+This function returns a subquery string with columns in the 
filter_cols_list
--- End diff --

I'm confused with this description string. My understanding is that the 
function doesn't really filter the columns - it filters the values for the 
columns i.e. it returns a subquery string that would filter values in 
exclude_from_table. If you're interested in filtering columns, then wouldn't 
querying the catalog and then doing a set difference be the better option? 


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-16 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151526779
  
--- Diff: src/ports/postgres/modules/graph/hits.py_in ---
@@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 FROM {1}
 """.format(vertex_id, vertex_table))[0]["cnt"]
 
-# Assign default threshold value based on number of nodes in the 
graph.
 if threshold is None:
-threshold = 1.0 / (n_vertices * 1000)
+threshold = 
get_default_threshold_for_centrality_measures(n_vertices)
 
+# table/column names used when grouping_cols is set.
 edge_temp_table = unique_string(desp='temp_edge')
+grouping_cols_comma = grouping_cols + ',' if grouping_cols else ''
 distribution = ('' if is_platform_pg() else
-"DISTRIBUTED BY ({0})".format(dest))
-plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table))
+"DISTRIBUTED BY ({0}{1})".format(
+grouping_cols_comma, dest))
+drop_tables([edge_temp_table])
 plpy.execute("""
 CREATE TEMP TABLE {edge_temp_table} AS
 SELECT * FROM {edge_table}
 {distribution}
 """.format(**locals()))
 
-# GPDB and HAWQ have distributed by clauses to help them with 
indexing.
-# For Postgres we add the index explicitly.
-if is_platform_pg():
-plpy.execute("CREATE INDEX ON {0}({1})".format(
-edge_temp_table, dest))
+
##
+# Set initial authority_norm and hub_norm as 1, so that later the 
final
+# norm should be positive number
+authority_init_value = 1.0
+hub_init_value = 1.0
+
+subquery1 = unique_string(desp='subquery1')
+
+distinct_grp_table = ''
+select_grouping_cols_comma = ''
+select_subquery1_grouping_cols_comma = ''
--- End diff --

My 2 cents: 
1. Descriptive names are definitely preferred. Here, however, it's not 
clear how `grouping_cols_comma` differs from `select_grouping_cols_comma` which 
differs from `select_subquery1_grouping_cols_comma` just by looking at the 
values.  I prefer for the variable name should indicate *what* the value 
represents rather than *where* it's used. 
2. My preference is to have an else clause that defines these empty 
variables. That makes it clear that the variables can have two possible values: 
one for grouping and the empty for non-grouping. 


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151226712
  
--- Diff: src/ports/postgres/modules/graph/hits.py_in ---
@@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 FROM {1}
 """.format(vertex_id, vertex_table))[0]["cnt"]
 
-# Assign default threshold value based on number of nodes in the 
graph.
 if threshold is None:
-threshold = 1.0 / (n_vertices * 1000)
+threshold = 
get_default_threshold_for_centrality_measures(n_vertices)
 
+# table/column names used when grouping_cols is set.
 edge_temp_table = unique_string(desp='temp_edge')
+grouping_cols_comma = grouping_cols + ',' if grouping_cols else ''
 distribution = ('' if is_platform_pg() else
-"DISTRIBUTED BY ({0})".format(dest))
-plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table))
+"DISTRIBUTED BY ({0}{1})".format(
+grouping_cols_comma, dest))
+drop_tables([edge_temp_table])
 plpy.execute("""
 CREATE TEMP TABLE {edge_temp_table} AS
 SELECT * FROM {edge_table}
 {distribution}
 """.format(**locals()))
 
-# GPDB and HAWQ have distributed by clauses to help them with 
indexing.
-# For Postgres we add the index explicitly.
-if is_platform_pg():
-plpy.execute("CREATE INDEX ON {0}({1})".format(
-edge_temp_table, dest))
+
##
+# Set initial authority_norm and hub_norm as 1, so that later the 
final
+# norm should be positive number
+authority_init_value = 1.0
+hub_init_value = 1.0
+
+subquery1 = unique_string(desp='subquery1')
+
+distinct_grp_table = ''
+select_grouping_cols_comma = ''
+select_subquery1_grouping_cols_comma = ''
--- End diff --

While this may not be perfect, we felt that it is slightly better than 
having short variable names that do not convey much meaning at all. 


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151226056
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -709,16 +709,35 @@ def _check_groups(tbl1, tbl2, grp_list):
 return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals())
  for i in grp_list])
 
-
-def _grp_from_table(tbl, grp_list):
-"""
-Helper function for selecting grouping columns of a table
+def get_filtered_cols_subquery_str(include_from_table, exclude_from_table,
--- End diff --

we removed the _ because get_filtered_cols_subquery_str is not for internal 
use, it is called by pagerank and the python files.


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151225853
  
--- Diff: src/ports/postgres/modules/graph/hits.sql_in ---
@@ -164,10 +168,20 @@ INSERT INTO edge VALUES
 (3, 0, 1),
 (4, 0, 1),
 (5, 6, 1),
-(6, 3, 1);
+(6, 3, 1),
+(0, 1, 2),
+(0, 2, 2),
+(0, 4, 2),
+(1, 2, 2),
+(1, 3, 2),
+(2, 3, 2),
+(3, 0, 2),
+(4, 0, 2),
+(5, 6, 2),
+(6, 3, 2);
 
 
--# Compute the HITS scores:
+-# Running HITS with default values for optional parameters:
--- End diff --

@jingyimei good point. we checked in a commit to remove the extra user_id 
`2` from the edge table when hits is called without the grouping column.  


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151063100
  
--- Diff: src/ports/postgres/modules/graph/graph_utils.py_in ---
@@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, 
edge_table, edge_params,
 
 return None
 
+def validate_params_for_centrality_measures(schema_madlib, func_name,
--- End diff --

+1


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151064800
  
--- Diff: src/ports/postgres/modules/graph/hits.py_in ---
@@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 FROM {1}
 """.format(vertex_id, vertex_table))[0]["cnt"]
 
-# Assign default threshold value based on number of nodes in the 
graph.
 if threshold is None:
-threshold = 1.0 / (n_vertices * 1000)
+threshold = 
get_default_threshold_for_centrality_measures(n_vertices)
 
+# table/column names used when grouping_cols is set.
 edge_temp_table = unique_string(desp='temp_edge')
+grouping_cols_comma = grouping_cols + ',' if grouping_cols else ''
 distribution = ('' if is_platform_pg() else
-"DISTRIBUTED BY ({0})".format(dest))
-plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table))
+"DISTRIBUTED BY ({0}{1})".format(
+grouping_cols_comma, dest))
+drop_tables([edge_temp_table])
 plpy.execute("""
 CREATE TEMP TABLE {edge_temp_table} AS
 SELECT * FROM {edge_table}
 {distribution}
 """.format(**locals()))
 
-# GPDB and HAWQ have distributed by clauses to help them with 
indexing.
-# For Postgres we add the index explicitly.
-if is_platform_pg():
-plpy.execute("CREATE INDEX ON {0}({1})".format(
-edge_temp_table, dest))
+
##
+# Set initial authority_norm and hub_norm as 1, so that later the 
final
+# norm should be positive number
+authority_init_value = 1.0
+hub_init_value = 1.0
+
+subquery1 = unique_string(desp='subquery1')
+
+distinct_grp_table = ''
+select_grouping_cols_comma = ''
+select_subquery1_grouping_cols_comma = ''
--- End diff --

I understood the incentive to rename those variables to make them more 
meaningful, however, seems this doesn't help simplify/reduce code but extends 
the length of many lines. When I read the code I still feel hard to keep track 
of them and understand the query. In this case I feel a shorter variable name 
is even simpler to read.


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-15 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r151061922
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -709,16 +709,35 @@ def _check_groups(tbl1, tbl2, grp_list):
 return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals())
  for i in grp_list])
 
-
-def _grp_from_table(tbl, grp_list):
-"""
-Helper function for selecting grouping columns of a table
+def get_filtered_cols_subquery_str(include_from_table, exclude_from_table,
--- End diff --

what is our consideration to define function name not start with 
underscore(weak internal use) while the previous one is a weak internal use?


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-13 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r150638553
  
--- Diff: src/ports/postgres/modules/utilities/validate_args.py_in ---
@@ -262,6 +262,13 @@ def get_first_schema(table_name):
 return None
 # -
 
+def drop_tables(table_list):
+"""
+Drop tables specified in table_list.
+"""
+drop_str = ', '.join([table for table in table_list])
--- End diff --

Why not just `', '.join(table_list)`?


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-13 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r150637914
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -709,6 +709,17 @@ def _check_groups(tbl1, tbl2, grp_list):
 return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals())
  for i in grp_list])
 
+def get_ignore_groups(first_table, second_table, grouping_cols_list):
--- End diff --

This function (and the `_grp_from_table`) are difficult to understand just 
from the names (for eg. what does `first_table` and `second_table` impy?). 
Further, these are returning specific SQL expressions - is there value in 
putting these in a general `utilities` module? If yes, then I would suggest 
making the use case more general. 
The `_grp_from_table` especially seems like extra work for a one-liner, 
without getting benefits of abstraction. 


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-13 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r150629090
  
--- Diff: src/ports/postgres/modules/graph/graph_utils.py_in ---
@@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, 
edge_table, edge_params,
 
 return None
 
+def validate_params_for_centrality_measures(schema_madlib, func_name,
--- End diff --

This function name is a bit confusing since it isn't used by the centrality 
measures functions from `measures.py_in`


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-13 Thread orhankislal
Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/madlib/pull/195#discussion_r150629575
  
--- Diff: src/ports/postgres/modules/graph/graph_utils.py_in ---
@@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, 
edge_table, edge_params,
 
 return None
 
+def validate_params_for_centrality_measures(schema_madlib, func_name,
+threshold, max_iter,
+edge_table=None,
+grouping_cols_list=None):
+_assert(not threshold or (threshold >= 0.0 and threshold <= 1.0),
+"{0}: Invalid threshold value ({1}), must be between 0 and 1.".
+format(func_name, threshold))
+_assert(max_iter > 0,
+"""{0}: Invalid max_iter value ({1}), must be a positive 
integer.""".
+format(func_name, max_iter))
+if grouping_cols_list:
+# validate the grouping columns. We currently only support 
grouping_cols
+# to be column names in the edge_table, and not expressions!
+_assert(columns_exist_in_table(edge_table, grouping_cols_list, 
schema_madlib),
+"{0} error: One or more grouping columns specified do not 
exist!".
+format(func_name))
+
+def 
update_output_grouping_tables_for_centrality_measures(temp_summary_table,
--- End diff --

Maybe a shorter name will help us keeping the lines under 80 characters.


---


[GitHub] madlib pull request #195: Feature: Add grouping support to HITS

2017-11-06 Thread kaknikhil
GitHub user kaknikhil opened a pull request:

https://github.com/apache/madlib/pull/195

 Feature: Add grouping support to HITS 

JIRA: MADLIB-1151

Changes to support grouping column in HITS. Update queries to
use group by and other necessary sql constructs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kaknikhil/madlib hits_grouping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #195


commit 9b47a4345f9efd60be65f5adf3a884e6c660050c
Author: Jingyi Mei 
Date:   2017-10-05T19:18:32Z

Feature: Add grouping support to HITS

JIRA: MADLIB-1151

Changes to support grouping column in HITS. Update queries to
use group by and other necessary sql constructs.

commit 5f88d0553d85251575eb4919d2d69b89b2fec162
Author: Nikhil Kak 
Date:   2017-10-30T23:01:34Z

Add install check test for hits grouping

- Add tests for calling hits with grouping cols
- Improve error messages if the result from hits_out does not meet
  expectations

commit c7ebe372fe3c0038ee47cfe9245d39d32c1ab95b
Author: Nikhil Kak 
Date:   2017-11-02T17:20:02Z

Update HITS examples and usage with grouping columns

commit 81b03cf88598627a2df1bad3b57302f6b2a0aa9d
Author: Nikhil Kak 
Date:   2017-11-02T22:41:18Z

Add user docs for grouping column

Also adds a few more examples for threshold and max_iter.

commit 5267d272150fb6071f2604a6516d1bf4a3db05d1
Author: Nikhil Kak 
Date:   2017-11-03T18:31:38Z

Refactor code, add comments, change var names.

- Refactor logic into functions
- This changes some code in pagerank, since we pulled out some
functions out of HITS code which can be used in pagerank too.
- Refactor indentation with PEP8 standards




---