[47/50] tinkerpop git commit: Merged vtslab recipe for connected components

spmallette Fri, 27 Jul 2018 06:20:49 -0700

Merged vtslab recipe for connected components


Project: http://git-wip-us.apache.org/repos/asf/tinkerpop/repo
Commit: http://git-wip-us.apache.org/repos/asf/tinkerpop/commit/ec2bf45e
Tree: http://git-wip-us.apache.org/repos/asf/tinkerpop/tree/ec2bf45e
Diff: http://git-wip-us.apache.org/repos/asf/tinkerpop/diff/ec2bf45e

Branch: refs/heads/TINKERPOP-1967
Commit: ec2bf45e95c475755378bb237abe6960bf21dd4b
Parents: e7a9b23
Author: HadoopMarc <vts...@xs4all.nl>
Authored: Mon May 21 14:03:54 2018 +0200
Committer: Stephen Mallette <sp...@genoprime.com>
Committed: Fri Jul 27 09:19:52 2018 -0400

----------------------------------------------------------------------
 docs/src/recipes/connected-components.asciidoc | 123 +++++++++++++-------
 1 file changed, 83 insertions(+), 40 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tinkerpop/blob/ec2bf45e/docs/src/recipes/connected-components.asciidoc
----------------------------------------------------------------------
diff --git a/docs/src/recipes/connected-components.asciidoc 
b/docs/src/recipes/connected-components.asciidoc
index 70abdbd..edbeec5 100644
--- a/docs/src/recipes/connected-components.asciidoc
+++ b/docs/src/recipes/connected-components.asciidoc
@@ -14,17 +14,31 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 
express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 ////
+
+// @author Daniel Kuppitz (anwer on gremlin user list)
+// @author Robert Dale (answer on gremlin user list)
+// @author Marc de Lignie
+
 [[connected-components]]
 == Connected Components
 
 Gremlin can be used to find 
link:https://en.wikipedia.org/wiki/Connected_component_(graph_theory)[connected 
components]
-in a graph. As of TinkerPop 3.4.0, the process has been simplified to the 
`connectedComponent()`-step which is
-described in more detail in the link:
-link:http://tinkerpop.apache.org/docs/x.y.z/reference/#connectedcomponent-step[Reference
 Documentation].
+in a graph. In a directed graph like in TinkerPop, components can be weakly or 
strongly connected. This recipe is
+restricted to finding 
link:https://en.wikipedia.org/wiki/Directed_graph#Directed_graph_connectivity[weakly
+connected components], in which the direction of edges is not taken into 
account.
+
+Depending on the size of the graph, three solution regimes can be 
discriminated:
+
+1. Small graphs that fit in the memory of a single machine
+
+2. Medium graphs backed by storage for which a linear scan is still feasible. 
This regime is left to third party
+TinkerPop implementations, since TinkerPop itself has no storage-backed 
reference implementations. The idea is that
+component membership is stored in the graph, rather than in memory.
 
-The `connectedComponent()`-step replaces the original recipe described below 
from earlier versions of TinkerPop,
-however the algorithm from that old recipe remains interesting for educational 
purposes and has thus not been removed.
-The original recipe considers the following graph which has three connected 
components:
+3. Large graphs requiring an OLAP approach to yield results in a reasonable 
time.
+
+
+These regimes are discussed separately using the following graph with three 
weakly connected components:
 
 image:connected-components.png[width=600]
 
@@ -41,46 +55,75 @@ g.addV().property(id, "A").as("a").
   addE("link").from("d").to("e").iterate()
 ----
 
-One way to detect the various subgraphs would be to do something like this:
+
+===== Small graphs
+
+Connected components in a small graph can be determined with both an OLTP 
traversal and the OLAP
+`connectedComponent()`-step. The `connectedComponent()`-step is available as 
of TinkerPop 3.4.0 and is
+described in more detail in the
+link:http://tinkerpop.apache.org/docs/x.y.z/reference/#connectedcomponent-step[Reference
 Documentation].
+
+A straightforward way to detect the various subgraphs with an OLTP traversal 
is to do this:
 
 [gremlin-groovy,existing]
 ----
-g.V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).  
<1>
-  path().aggregate("p").                                                       
<2>
-  unfold().dedup().                                                            
<3>
-  map(__.as("v").select("p").unfold().                                         
<4>
-         filter(unfold().where(eq("v"))).
-         unfold().dedup().order().by(id).fold()).
-  dedup()                                                                      
<5>
+g.V().emit(cyclicPath().or().not(both())).                                    
<1>
+    repeat(__.where(without('a')).store('a').both()).until(cyclicPath()).     
<2>
+    group().by(path().unfold().limit(1)).                                     
<3>
+    by(path().unfold().dedup().fold()).                                       
<4>
+    select(values).unfold()                                                   
<5>
 ----
 
-<1> Iterate all vertices and repeatedly traverse over both incoming and 
outgoing edges (TinkerPop doesn't support
-unidirectional graphs directly so it must be simulated by ignoring the 
direction with `both`). Note the use of `emit`
-prior to `repeat` as this allows for return of a single length path.
-<2> Aggregate the `path()` of the emitted vertices to "p". It is within these 
paths that the list of connected
-components will be identified. Obviously the paths list are duplicative in the 
sense that they contains different
-paths traveled over the same vertices.
-<3> Unroll the elements in the path list with `unfold` and `dedup`.
-<4> Use the first vertex in each path to filter against the paths stored in 
"p". When a path is found that has the
-vertex in it, dedup the vertices in the path, order it by the identifier. Each 
path output from this `map` step
-represents a connected component.
-<5> The connected component list is duplicative given the nature of the paths 
in "p", but now that the vertices within
-the paths are ordered, a final `dedup` will make the list of connective 
components unique.
-
-NOTE: This is a nice example of where running smaller pieces of a large 
Gremlin statement make it easier to see what
-is happening at each step. Consider running this example one line at a time 
(or perhaps even in a step at a time) to
-see the output at each point.
-
-While the above approach returns results nicely, the traversal doesn't appear 
to work with OLAP. A less efficient
-approach, but one more suited for OLAP execution looks quite similar but does 
not use `dedup` as heavily (thus
-`GraphComputer` is forced to analyze far more paths):
+<1> The initial emit() step allows for output of isolated vertices, in 
addition to the discovery of
+components as described in (2).
+
+<2> The entire component to which the first returned vertex belongs, is 
visited. To allow for components of any
+structure, a repeat loop is applied that only stops for a particular branch of 
the component when it detects a cyclic
+path.  Collection `'a'` is used to keep track of visited vertices, for both 
subtraversals within a component
+and new traversals resulting from the `g.V()` linear scan.
+
+<3> While `'a'` nicely keeps track of vertices already visited, the actual 
components need to be extracted from the
+path information of surviving traversers. The `path().unfold().limit(1)` 
closure provides the starting vertex
+of surviving traversers, which can be used to group the components.
+
+<4> This clause collects the unique vertices from all paths with the same 
starting vertex, thus from the same
+weak component.
+
+<5> The values of the groupby map contain the lists of vertices making up the 
requested components.
+
+This algorithm completes in linear time with the number of vertices and edges, 
because a traversal is started for each
+vertex and each edge with its associated out-vertex is visited exactly once.
+
+
+==== Large graphs
+
+Large graphs require an OLAP solution with a custom VertexProgram that can be 
run using a graph implementation's
+GraphComputer, in particular `SparkGraphComputer` on a `HadoopGraph`. The OLAP 
solution also runs faster for most
+medium-sized graphs, that is when these graph have a 'natural' structure with 
a limited maximum path length.
+
+The TinkerPop library of vertex programs contains the 
`WeakComponentsVertexProgram` which can be run in the same
+way as the 
link:http://tinkerpop.apache.org/docs/x.y.z/reference/#peerpressurevertexprogram[PeerPressureVertexProgram]:
 
 [gremlin-groovy,existing]
 ----
-g.withComputer().V().emit(cyclicPath().or().not(both())).repeat(both()).until(cyclicPath()).
-  aggregate("p").by(path()).cap("p").unfold().limit(local, 1).
-  map(__.as("v").select("p").unfold().
-         filter(unfold().where(eq("v"))).
-         unfold().dedup().order().by(id).fold()
-  ).toSet()
+result = g.getGraph().compute().
+    program(WeakComponentsVertexProgram.build().maxIterations(100).create()).
+    mapReduce(ClusterPopulationMapReduce.build().create()).
+    mapReduce(ClusterCountMapReduce.build().create()).
+    submit().get()
+result.memory().clusterPopulation
+gResult = result.graph().traversal()
+gResult.V().valueMap(true)
 ----
+
+The vertex program has interconnected vertices exchange id's and store the 
lowest id until no vertex receives a
+lower id. This algorithm is commonly applied in
+link:https://en.wikipedia.org/wiki/Bulk_synchronous_parallel[bulk synchronous 
parallel] systems, e.g. in
+link:https://spark.apache.org/graphx[Apache Spark GraphX].
+
+==== Scalability
+
+ToDo:
+ - limits and run time regime 1
+ - test of friendster graph regime 3
+ - discuss: 
link:http://www.vldb.org/pvldb/vol7/p1821-yan.pdf[http://www.vldb.org/pvldb/vol7/p1821-yan.pdf]
\ No newline at end of file

[47/50] tinkerpop git commit: Merged vtslab recipe for connected components

Reply via email to