Marko A. Rodriguez created TINKERPOP-1309:
---------------------------------------------
Summary: Memory output in HadoopGraph is too strongly tied to
MapReduce and should be generalized.
Key: TINKERPOP-1309
URL: https://issues.apache.org/jira/browse/TINKERPOP-1309
Project: TinkerPop
Issue Type: Improvement
Components: hadoop, process
Affects Versions: 3.2.0-incubating
Reporter: Marko A. Rodriguez
The {{Memory}} object is not being written to disk in {{SparkGraphComputer}}
unless its being updated within a {{MapReduce}} job. That is no bueno. We
should really have the computed {{Memory}} be written as such:
{code}
hdfs.ls("output")
==>~g
==>~memory
{code}
Moreover, {{~g}} should be {{~graph}} :) but that is a different story...
Then:
{code}
{code}
hdfs.ls("output/~memory")
==>gremlin.traversalVertexProgram.haltedTraversals
==>a
==>x
{code}
Note that every {{GraphComputer}} job yields a {{ComputerResult}} which is
basically {{Pair<Graph,Memory>}}. The {{Graph}} reference denotes the adjacency
list of vertices and on all those vertices, if there are HALTED_TRAVERSERS,
they will be on those vertices. This is a distributed representation. Next, the
{{Memory}} reference denotes data that is no longer "attached to the graph" --
like maps, counts, sums, etc. In general, reduction barriers. This data is not
tied to any one vertex anymore an thus exists at the "master traversal" via
{{Memory}}. Thus, "graph is distributed/workers" and "memory is local/master."
We need to make sure that the {{Memory}} data is serialized to disk
appropriately for {{HadoopGraph}}-based implementations...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)