[
https://issues.apache.org/jira/browse/MADLIB-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140769#comment-16140769
]
Jingyi Mei commented on MADLIB-1124:
------------------------------------
Here is the interface I plan to implement, please comment on this proposal
{code}
hits( vertex_table,
vertex_id,
edge_table,
edge_args,
out_table,
max_iter,
threshold,
grouping_cols
)
{code}
Arguments
{code}
vertex_table
TEXT. Name of the table containing the vertex data for the graph. Must contain
the column specified in the 'vertex_id' parameter below.
vertex_id
TEXT, default = 'id'. Name of the column in 'vertex_table' containing vertex
ids. The vertex ids are of type INTEGER with no duplicates. They do not need to
be contiguous.
edge_table
TEXT. Name of the table containing the edge data. The edge table must contain
columns for source vertex and destination vertex.
edge_args
TEXT. A comma-delimited string containing multiple named arguments of the form
"name=value". The following parameters are supported for this string argument:
src (INTEGER): Name of the column containing the source vertex ids in the edge
table. Default column name is 'src'.
dest (INTEGER): Name of the column containing the destination vertex ids in the
edge table. Default column name is 'dest'.
out_table
TEXT. Name of the table to store the result of HITS. It will contain a row for
every vertex from 'vertex_table' with the following columns:
vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for
column naming.
authority : The vertex's Authority score.
hub : The vertex's Hub score.
grouping_cols : Grouping column (if any) values associated with the vertex_id.
A summary table is also created that contains information regarding the number
of iterations required for convergence. It is named by adding the suffix
'_summary' to the 'out_table' parameter.
max_iter
INTEGER, default: 100. The maximum number of iterations allowed. An interation
consists of both Authority and Hub phases.
threshold
FLOAT8, default: (1/number of vertices * 100). If the difference between the
values of both scores (Authority and Hub) for every vertex of two consecutive
iterations is smaller than 'threshold', or the iteration number is larger than
'max_iter', the computation stops. If you set the threshold to zero, then you
will force the algorithm to run for the full number of iterations specified in
'max_iter'. It is advisable to set threshold to a value lower than 1 since both
values (Authority and Hub) of nodes are initialized as 1. Note that both
Authority and Hub value difference must be below threshold for the algorithm to
stop.
grouping_cols (optional)
TEXT, default: NULL. A single column or a list of comma-separated columns that
divides the input data into discrete groups, resulting in one distribution per
group. When this value is NULL, no grouping is used and a single model is
generated for all data.
Note
Expressions are not currently supported for 'grouping_cols'.
The grouping support will be added later.
{code}
> Graph - HITS algorithm
> ----------------------
>
> Key: MADLIB-1124
> URL: https://issues.apache.org/jira/browse/MADLIB-1124
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Graph
> Reporter: Frank McQuillan
> Assignee: Jingyi Mei
> Fix For: v2.0
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)