[jira] [Updated] (TINKERPOP-1309) Memory output in HadoopGraph is too strongly tied to MapReduce and should be generalized.

Marko A. Rodriguez (JIRA) Tue, 24 May 2016 10:12:12 -0700

     [ 
https://issues.apache.org/jira/browse/TINKERPOP-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Marko A. Rodriguez updated TINKERPOP-1309:
------------------------------------------
    Description: 
The {{Memory}} object is not being written to disk in {{SparkGraphComputer}} 
unless its being updated within a {{MapReduce}} job. That is no bueno. We 
should really have the computed {{Memory}} be written as such:

{code}
hdfs.ls("output")
==>~g
==>~memory
{code}

Moreover, {{~g}} should be {{~graph}} :) but that is a different story...

Then:

{code}
hdfs.ls("output/~memory")
==>gremlin.traversalVertexProgram.haltedTraversals
==>a
==>x
{code}

Note that every {{GraphComputer}} job yields a {{ComputerResult}} which is 
basically {{Pair<Graph,Memory>}}. The {{Graph}} reference denotes the adjacency 
list of vertices and on all those vertices, if there are HALTED_TRAVERSERS, 
they will be on those vertices. This is a distributed representation. Next, the 
{{Memory}} reference denotes data that is no longer "attached to the graph" -- 
like maps, counts, sums, etc. In general, reduction barriers. This data is not 
tied to any one vertex anymore an thus exists at the "master traversal" via 
{{Memory}}. Thus, "graph is distributed/workers" and "memory is local/master." 
We need to make sure that the {{Memory}} data is serialized to disk 
appropriately for {{HadoopGraph}}-based implementations... 

  was:
The {{Memory}} object is not being written to disk in {{SparkGraphComputer}} 
unless its being updated within a {{MapReduce}} job. That is no bueno. We 
should really have the computed {{Memory}} be written as such:

{code}
hdfs.ls("output")
==>~g
==>~memory
{code}

Moreover, {{~g}} should be {{~graph}} :) but that is a different story...

Then:

{code}

{code}
hdfs.ls("output/~memory")
==>gremlin.traversalVertexProgram.haltedTraversals
==>a
==>x
{code}

Note that every {{GraphComputer}} job yields a {{ComputerResult}} which is 
basically {{Pair<Graph,Memory>}}. The {{Graph}} reference denotes the adjacency 
list of vertices and on all those vertices, if there are HALTED_TRAVERSERS, 
they will be on those vertices. This is a distributed representation. Next, the 
{{Memory}} reference denotes data that is no longer "attached to the graph" -- 
like maps, counts, sums, etc. In general, reduction barriers. This data is not 
tied to any one vertex anymore an thus exists at the "master traversal" via 
{{Memory}}. Thus, "graph is distributed/workers" and "memory is local/master." 
We need to make sure that the {{Memory}} data is serialized to disk 
appropriately for {{HadoopGraph}}-based implementations... 


> Memory output in HadoopGraph is too strongly tied to MapReduce and should be 
> generalized.
> -----------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-1309
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1309
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: hadoop, process
>    Affects Versions: 3.2.0-incubating
>            Reporter: Marko A. Rodriguez
>              Labels: breaking
>
> The {{Memory}} object is not being written to disk in {{SparkGraphComputer}} 
> unless its being updated within a {{MapReduce}} job. That is no bueno. We 
> should really have the computed {{Memory}} be written as such:
> {code}
> hdfs.ls("output")
> ==>~g
> ==>~memory
> {code}
> Moreover, {{~g}} should be {{~graph}} :) but that is a different story...
> Then:
> {code}
> hdfs.ls("output/~memory")
> ==>gremlin.traversalVertexProgram.haltedTraversals
> ==>a
> ==>x
> {code}
> Note that every {{GraphComputer}} job yields a {{ComputerResult}} which is 
> basically {{Pair<Graph,Memory>}}. The {{Graph}} reference denotes the 
> adjacency list of vertices and on all those vertices, if there are 
> HALTED_TRAVERSERS, they will be on those vertices. This is a distributed 
> representation. Next, the {{Memory}} reference denotes data that is no longer 
> "attached to the graph" -- like maps, counts, sums, etc. In general, 
> reduction barriers. This data is not tied to any one vertex anymore an thus 
> exists at the "master traversal" via {{Memory}}. Thus, "graph is 
> distributed/workers" and "memory is local/master." We need to make sure that 
> the {{Memory}} data is serialized to disk appropriately for 
> {{HadoopGraph}}-based implementations... 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TINKERPOP-1309) Memory output in HadoopGraph is too strongly tied to MapReduce and should be generalized.

Reply via email to