Repository: incubator-madlib Updated Branches: refs/heads/latest_release a3863b6c2 -> 8e2778a39
Graph: - Create generic graph validation and help message to standardize future graph algorithm development. - Expand the design document with more detail on the graph representation as well as the SSSP implementation. Closes #105 Project: http://git-wip-us.apache.org/repos/asf/incubator-madlib/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-madlib/commit/01586c0d Tree: http://git-wip-us.apache.org/repos/asf/incubator-madlib/tree/01586c0d Diff: http://git-wip-us.apache.org/repos/asf/incubator-madlib/diff/01586c0d Branch: refs/heads/latest_release Commit: 01586c0d05761a794efa953c09fa568f27c84cb7 Parents: a3863b6 Author: Orhan Kislal <okis...@pivotal.io> Authored: Mon Mar 13 16:38:34 2017 -0700 Committer: Orhan Kislal <okis...@pivotal.io> Committed: Mon Mar 13 16:38:34 2017 -0700 ---------------------------------------------------------------------- doc/design/figures/graph_example.pdf | Bin 0 -> 23083 bytes doc/design/modules/graph.tex | 208 ++++++++++++++++++- .../postgres/modules/graph/graph_utils.py_in | 107 ++++++++++ .../postgres/modules/graph/graph_utils.sql_in | 0 src/ports/postgres/modules/graph/sssp.py_in | 62 +----- src/ports/postgres/modules/graph/sssp.sql_in | 1 - 6 files changed, 315 insertions(+), 63 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/01586c0d/doc/design/figures/graph_example.pdf ---------------------------------------------------------------------- diff --git a/doc/design/figures/graph_example.pdf b/doc/design/figures/graph_example.pdf new file mode 100644 index 0000000..fd29e5f Binary files /dev/null and b/doc/design/figures/graph_example.pdf differ http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/01586c0d/doc/design/modules/graph.tex ---------------------------------------------------------------------- diff --git a/doc/design/modules/graph.tex b/doc/design/modules/graph.tex index 758f407..5c3910c 100644 --- a/doc/design/modules/graph.tex +++ b/doc/design/modules/graph.tex @@ -1,4 +1,5 @@ -% When using TeXShop on the Mac, let it know the root document. The following must be one of the first 20 lines. +% When using TeXShop on the Mac, let it know the root document. The following +% must be one of the first 20 lines. % !TEX root = ../design.tex % Licensed to the Apache Software Foundation (ASF) under one @@ -25,31 +26,99 @@ \item[History] \begin{modulehistory} \item[v0.1] Initial version, SSSP only. + \item[v0.2] Graph Framework, SSSP implementation details. \end{modulehistory} \end{moduleinfo} % Abstract. What is the problem we want to solve? -This module implements various graph algorithms that are used in a number of applications such as social networks, telecommunications and road networks. +This module implements various graph algorithms that are used in a number of +applications such as social networks, telecommunications and road networks. -% \section{Graph Representation} \label{sec:graph:rep} +\section{Graph Framework} \label{sec:graph:fw} + +MADlib graph representation depends on two structures, a \emph{vertex} table +and an \emph{edge} table. The vertex table has to have a column of vertex ids. +The edge table has to have 2 columns: source vertex id, destination vertex id. +For most algorithms an edge weight column is required as well. The +representation assumes a directed graph, an edge from $x$ to $y$ does +\emph{not} guarantee the existence of an edge from $y$ to $x$. Both of the +tables may have additional columns as required. Multi-edges (multiple edges +from a vertex to the same destination) and loops (edge from a vertex to +itself) are allowed. This representation does not impose any ordering of +vertices or edges. An example graph is given in Figure~\ref{sssp:example} and +its representative tables are given in Table~\ref{sssp:rep}. + +\begin{figure}[h] + \centering + \includegraphics[width=0.9\textwidth]{figures/graph_example.pdf} +\caption{A sample graph} +\label{sssp:example} +\end{figure} + +\begin{table} + \begin{tabular}{| c | } + \hline + vid \\ \hline + 0 \\ \hline + 1 \\ \hline + 2 \\ \hline + 3 \\ \hline + 4 \\ \hline + 5 \\ \hline + 6 \\ \hline + 7 \\ + \hline + \end{tabular} + \quad + \begin{tabular}{| c | c | c |} + \hline + src & dest & weight \\ \hline + 0 & 1 & 1 \\ \hline + 0 & 2 & 1 \\ \hline + 0 & 4 & 10 \\ \hline + 1 & 2 & 2 \\ \hline + 1 & 3 & 10 \\ \hline + 1 & 5 & 1 \\ \hline + 2 & 3 & 1 \\ \hline + 2 & 5 & 1 \\ \hline + 2 & 6 & 3 \\ \hline + 3 & 0 & 1 \\ \hline + 5 & 6 & 1 \\ \hline + 6 & 7 & 1 \\ + \hline + \end{tabular} + \caption{Graph representation of vertices (left) and edges(right) in the + database} + \label{sssp:rep} +\end{table} -% Our graph representation depends on two structures, a \emph{vertex} table and an \emph{edge} table. \section{Single Source Shortest Path} \label{sec:graph:sssp} -Given a graph and a source vertex, single source shortest path (SSSP) algorithm finds a path for every vertex such that the sum of the weights of its constituent edges is minimized. +Given a graph and a source vertex, single source shortest path (SSSP) +algorithm finds a path for every vertex such that the sum of the weights of +its constituent edges is minimized. -Shortest path is defined as follows. Let $e_{i,j}$ be the edge from vertex $i$ to vertex $j$ and $w_{i,j}$ be its weight. Given a graph G, the shortest path from $s$ to $d$ is $P = (v_1, v_2 \dots, v_n)$ (where $v_1=s$ and $v_n=d$) that over all possible $n$ minimizes the sum $ \sum _{i=1}^{n-1}f(e_{i,i+1})$. +Shortest path is defined as follows. Let $e_{i,j}$ be the edge from vertex $i$ +to vertex $j$ and $w_{i,j}$ be its weight. Given a graph G, the shortest path +from $s$ to $d$ is $P = (v_1, v_2 \dots, v_n)$ (where $v_1=s$ and $v_n=d$) +that over all possible $n$ minimizes the sum $ \sum _{i=1}^{n-1}f(e_{i,i+1})$. % \subsection{Bellman Ford Algorithm} -Bellman-Ford Algorithm \cite{bellman1958routing,ford1956network} is based on the following idea: We start with a naive approximation for the cost of reaching every vertex. At each iteration, these values are refined based on the edge list and the existing approximations. If there are no refinements at any given step, the algorithm returns the calculated results. If the algorithm does not converge in $|V|-1$ iterations, this indicates the existence of a negative cycle in the graph. +Bellman-Ford Algorithm \cite{bellman1958routing,ford1956network} is based on +the following idea: We start with a naive approximation for the cost of +reaching every vertex. At each iteration, these values are refined based on +the edge list and the existing approximations. If there are no refinements at +any given step, the algorithm returns the calculated results. If the algorithm +does not converge in $|V|-1$ iterations, this indicates the existence of a +negative cycle in the graph. \begin{algorithm}[SSSP$(V,E,start)$] \label{alg:sssp} -\alginput{Vertex set $V$, edge set $E$, starting vertex $start$} +\alginput{Vertex set $v$, edge set $E$, starting vertex $start$} \algoutput{Distance and parent set for every vertex $cur$} \begin{algorithmic}[1] \State $toupdate(0) \set (start,0,start)$ @@ -80,9 +149,126 @@ Bellman-Ford Algorithm \cite{bellman1958routing,ford1956network} is based on the Changes from the standard Bellman-Ford algorithm: \begin{description} -\item Line~\ref{alg:sssp:update}: We only check the vertices that have been updated in the last iteration. -\item Line~\ref{alg:sssp:single}: At each iteration, we update a given vertex only one time. This means the toupdate set cannot contain multiple records for the same vertex which requires the comparison with the existing value. +\item Line~\ref{alg:sssp:update}: We only check the vertices that have been +updated in the last iteration. +\item Line~\ref{alg:sssp:single}: At each iteration, we update a given vertex +only one time. This means the toupdate set cannot contain multiple records +for the same vertex which requires the comparison with the existing value. \end{description} -This is not a 1-to-1 pseudocode for the implementation since we don't compare the `toupdate` table records one by one but calculate the overall minimum. In addition, the comparison with `cur` values take place earlier to reduce the number of tuples in the `toupdate` table. +This is not a 1-to-1 pseudocode for the implementation since we don't compare +the `toupdate` table records one by one but calculate the overall minimum. In +addition, the comparison with `cur` values take place earlier to reduce the +number of tuples in the `toupdate` table. + +\subsection{Implementation Details} + +In this section, we discuss the MADlib implementation of the SSSP algorithm +in depth. + +\begin{algorithm}[SSSP$(V,E,start)$] \label{alg:sssp:high} +\begin{algorithmic}[1] + \Repeat + \State Find Updates + \State Apply updates to the output table + \Until {There are no updates} +\end{algorithmic} +\end{algorithm} + +The implementation consists of two SQL blocks that are called sequentially +inside a loop. We will follow the example graph at Figure~\ref{sssp:example} +with the starting point as $v_0$. The very first update on the output table is +the source vertex. Its weight is $0$ and its parent is itself ($v_0$). After +this initialization step, the loop starts with Find Updates (the individual +updates will be represented with <dest,value,parent> format). Looking at the +example, it is clear that the updates should be <1,1,0>, <2,1,0> and <4,10,0>. +We will assume this iteration is already completed and look how the next +iteration of the algorithm works to explain the implementation details. + +\begin{algorithm}[Find Updates$(E,old\_update,out\_table)$] +\label{alg:sssp:findu} +\begin{lstlisting} +INSERT INTO new_update + SELECT DISTINCT ON (y.id) y.id AS id, + y.val AS val, + y.parent AS parent + FROM out_table INNER JOIN ( + SELECT edge_table.dest AS id, x.val AS val, old_update.id AS parent + FROM old_update + INNER JOIN edge_table + ON (edge_table.src = old_update.id) + INNER JOIN ( + SELECT edge_table.dest AS id, + min(old_update.val + edge_table.weight) AS val + FROM old_update INNER JOIN + edge_table AS edge_table ON + (edge_table.src=old_update.id) + GROUP BY edge_table.dest + ) x + ON (edge_table.dest = x.id) + WHERE ABS(old_update.val + edge_table.weight - x.val) < EPSILON + ) AS y ON (y.id = out_table.vertex_id) + WHERE y.val<out_table.weight +\end{lstlisting} +\end{algorithm} + +The Find Updates query is constructed in 4 levels of subqueries: \emph{find +values, find parents, eliminate duplicates and ensure improvement}. + +\begin{itemize} + +\item We begin our analysis at the innermost subquery, emph{find values} +(lines 11-16). This subquery takes a set of vertices (in the table +$old_update$) and finds the reachable vertices. In case a vertex is reachable +by multiple vertices, only the path that has the minimum cost is considered +(hence the name find values). There are two important points to note: + \begin{itemize} + \item The input vertices need the value of their path as well. + \begin{itemize} + \item In our example, both $v_1$ and $v_2$ can reach $v_3$. We would + have to use $v_2$ -> $v_3$ edge since that gives the lowest possible + path value. + \end{itemize} + \item The subquery is aggregating the rows using the $min$ operator for + each destination vertex and unable to return the source vertex at the + same time to use as the parent value. + \begin{itemize} + \item We know the value of $v_3$ should be $2$ but we cannot know + its parent ($v_2$) at the same time. + \end{itemize} + \end{itemize} + +\item The \emph{find parents} subquery is designed to solve the +aforementioned limitation. We combine the result of \emph{find values} with +$edge$ and $old\_update$ tables (lines 7-10) and get the rows that has the +same minimum value. + \begin{itemize} + \item Note that, we would have to tackle the problem of tie-breaking. + \begin{itemize} + \item Vertex $v_5$ has two paths leading into: <5,2,1> and <5,2,2>. + The inner subquery will return <5,2> and it will match both of these + edges. + \end{itemize} + \item It is redundant to keep both of them in the update list as that + would require updating the same vertex multiple times in a given + iteration. + \end{itemize} + +\item At this level, we employ the \emph{eliminate duplicates} subquery. By +using the $DISTINCT$ clause at line 2, we allow the underlying system to +accept only a single one of them. + +\item Finally, we introduce the \emph{ensure improvement} subquery to make +sure these updates are actually leading us to shortest paths. Line 21 ensures +that the values stored in the $out\_table$ does not increase and the solution +does not regress throughout the iterations. +\end{itemize} + +Applying updates is straightforward as the values and the associated parent +values are replaced using the $new\_update$ table. After this operation is +completed the $new\_update$ table becomes $old\_update$ for the next iteration +of the algorithm. + +Please note that, for ideal performance, \emph{vertex} and \emph{edge} tables +should be distributed on \emph{vertex id} and \emph{source id} respectively. http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/01586c0d/src/ports/postgres/modules/graph/graph_utils.py_in ---------------------------------------------------------------------- diff --git a/src/ports/postgres/modules/graph/graph_utils.py_in b/src/ports/postgres/modules/graph/graph_utils.py_in new file mode 100644 index 0000000..fb43491 --- /dev/null +++ b/src/ports/postgres/modules/graph/graph_utils.py_in @@ -0,0 +1,107 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Graph Methods + +# Please refer to the graph.sql_in file for the documentation + +""" +@file graph.py_in + +@namespace graph +""" + +import plpy +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import extract_keyvalue_params +from utilities.utilities import unique_string +from utilities.validate_args import get_cols +from utilities.validate_args import unquote_ident +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty + + +def validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, + out_table, func_name, **kwargs): + """ + Validates graph tables (vertex and edge) as well as the output table. + """ + _assert(out_table and out_table.strip().lower() not in ('null', ''), + "Graph {func_name}: Invalid output table name!".format(**locals())) + _assert(not table_exists(out_table), + "Graph {func_name}: Output table already exists!".format(**locals())) + + _assert(vertex_table and vertex_table.strip().lower() not in ('null', ''), + "Graph {func_name}: Invalid vertex table name!".format(**locals())) + _assert(table_exists(vertex_table), + "Graph {func_name}: Vertex table ({vertex_table}) is missing!".format( + **locals())) + _assert(not table_is_empty(vertex_table), + "Graph {func_name}: Vertex table ({vertex_table}) is empty!".format( + **locals())) + + _assert(edge_table and edge_table.strip().lower() not in ('null', ''), + "Graph {func_name}: Invalid edge table name!".format(**locals())) + _assert(table_exists(edge_table), + "Graph {func_name}: Edge table ({edge_table}) is missing!".format( + **locals())) + _assert(not table_is_empty(edge_table), + "Graph {func_name}: Edge table ({edge_table}) is empty!".format( + **locals())) + + existing_cols = set(unquote_ident(i) for i in get_cols(vertex_table)) + _assert(vertex_id in existing_cols, + """Graph {func_name}: The vertex column {vertex_id} is not present in + vertex table ({vertex_table}) """.format(**locals())) + _assert(columns_exist_in_table(edge_table, edge_params.values()), + """Graph {func_name}: Not all columns from {cols} present in edge + table ({edge_table})""".format(cols=edge_params.values(), **locals())) + + return None + +def get_graph_usage(schema_madlib, func_name, other_text): + + usage = """ +---------------------------------------------------------------------------- + USAGE +---------------------------------------------------------------------------- + SELECT {schema_madlib}.{func_name}( + vertex_table TEXT, -- Name of the table that contains the vertex data. + vertex_id TEXT, -- Name of the column containing the vertex ids. + edge_table TEXT, -- Name of the table that contains the edge data. + edge_args TEXT{comma} -- A comma-delimited string containing multiple + -- named arguments of the form "name=value". + {other_text} +); + +The following parameters are supported for edge table arguments ('edge_args' + above): + +src (default = 'src') : Name of the column containing the source + vertex ids in the edge table. +dest (default = 'dest') : Name of the column containing the destination + vertex ids in the edge table. +weight (default = 'weight') : Name of the column containing the weight of + edges in the edge table. +""".format(schema_madlib=schema_madlib, func_name=func_name, + other_text=other_text, comma = ',' if other_text is not None else ' ') + + return usage http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/01586c0d/src/ports/postgres/modules/graph/graph_utils.sql_in ---------------------------------------------------------------------- diff --git a/src/ports/postgres/modules/graph/graph_utils.sql_in b/src/ports/postgres/modules/graph/graph_utils.sql_in new file mode 100644 index 0000000..e69de29 http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/01586c0d/src/ports/postgres/modules/graph/sssp.py_in ---------------------------------------------------------------------- diff --git a/src/ports/postgres/modules/graph/sssp.py_in b/src/ports/postgres/modules/graph/sssp.py_in index 558ec3d..4d27761 100644 --- a/src/ports/postgres/modules/graph/sssp.py_in +++ b/src/ports/postgres/modules/graph/sssp.py_in @@ -28,6 +28,7 @@ """ import plpy +from graph_utils import * from utilities.control import MinWarning from utilities.utilities import _assert from utilities.utilities import extract_keyvalue_params @@ -84,7 +85,7 @@ def graph_sssp(schema_madlib, vertex_table, vertex_id, edge_table, local_distribution = m4_ifdef(<!__POSTGRESQL__!>, <!''!>, <!"DISTRIBUTED BY (id)"!>) - validate_graph_coding(vertex_table, vertex_id, edge_table, + validate_sssp(vertex_table, vertex_id, edge_table, edge_params, source_vertex, out_table) plpy.execute(" DROP TABLE IF EXISTS {0},{1},{2}".format( @@ -284,35 +285,11 @@ def graph_sssp_get_path(schema_madlib, sssp_table, dest_vertex, **kwargs): return None -def validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, +def validate_sssp(vertex_table, vertex_id, edge_table, edge_params, source_vertex, out_table, **kwargs): - _assert(out_table and out_table.strip().lower() not in ('null', ''), - "Graph SSSP: Invalid output table name!") - _assert(not table_exists(out_table), - "Graph SSSP: Output table already exists!") - - _assert(vertex_table and vertex_table.strip().lower() not in ('null', ''), - "Graph SSSP: Invalid vertex table name!") - _assert(table_exists(vertex_table), - "Graph SSSP: Vertex table ({0}) is missing!".format(vertex_table)) - _assert(not table_is_empty(vertex_table), - "Graph SSSP: Vertex table ({0}) is empty!".format(vertex_table)) - - _assert(edge_table and edge_table.strip().lower() not in ('null', ''), - "Graph SSSP: Invalid edge table name!") - _assert(table_exists(edge_table), - "Graph SSSP: Edge table ({0}) is missing!".format(edge_table)) - _assert(not table_is_empty(edge_table), - "Graph SSSP: Edge table ({0}) is empty!".format(edge_table)) - - existing_cols = set(unquote_ident(i) for i in get_cols(vertex_table)) - _assert(vertex_id in existing_cols, - """Graph SSSP: The vertex column {vertex_id} is not present in vertex - table ({vertex_table}) """.format(**locals())) - _assert(columns_exist_in_table(edge_table, edge_params.values()), - "Graph SSSP: Not all columns from {0} present in edge table ({1})". - format(edge_params.values(), edge_table)) + validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, + out_table,'SSSP') _assert(isinstance(source_vertex,int), """Graph SSSP: Source vertex {source_vertex} has to be an integer """. @@ -377,28 +354,7 @@ For more details on function usage: """ elif message in ['usage', 'help', '?']: help_string = """ ----------------------------------------------------------------------------- - USAGE ----------------------------------------------------------------------------- - SELECT {schema_madlib}.graph_sssp( - vertex_table TEXT, -- Name of the table that contains the vertex data. - vertex_id TEXT, -- Name of the column containing the vertex ids. - edge_table TEXT, -- Name of the table that contains the edge data. - edge_args TEXT, -- A comma-delimited string containing multiple - -- named arguments of the form "name=value". - source_vertex INT, -- The source vertex id for the algorithm to start. - out_table TEXT -- Name of the table to store the result of SSSP. -); - -The following parameters are supported for edge table arguments ('edge_args' - above): - -src (default = 'src') : Name of the column containing the source - vertex ids in the edge table. -dest (default = 'dest') : Name of the column containing the destination - vertex ids in the edge table. -weight (default = 'weight') : Name of the column containing the weight of - edges in the edge table. +{graph_usage} To retrieve the path for a specific vertex: @@ -428,5 +384,9 @@ shortest path from the initial source vertex to the desired destination vertex. else: help_string = "No such option. Use {schema_madlib}.graph_sssp()" - return help_string.format(schema_madlib=schema_madlib) + return help_string.format(schema_madlib=schema_madlib, + graph_usage=get_graph_usage(schema_madlib, 'graph_sssp', + """source_vertex INT, -- The source vertex id for the algorithm to start. + out_table TEXT -- Name of the table to store the result of SSSP.""")) # --------------------------------------------------------------------- + http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/01586c0d/src/ports/postgres/modules/graph/sssp.sql_in ---------------------------------------------------------------------- diff --git a/src/ports/postgres/modules/graph/sssp.sql_in b/src/ports/postgres/modules/graph/sssp.sql_in index 7534a75..7f89823 100644 --- a/src/ports/postgres/modules/graph/sssp.sql_in +++ b/src/ports/postgres/modules/graph/sssp.sql_in @@ -286,4 +286,3 @@ RETURNS VARCHAR AS $$ $$ LANGUAGE sql IMMUTABLE m4_ifdef(`\_\_HAS_FUNCTION_PROPERTIES\_\_', `CONTAINS SQL', `'); -------------------------------------------------------------------------------- -