[jira] [Created] (TINKERPOP-1108) Produce two RDDs from executeVertexProgram in SparkGraphComputer

Marko A. Rodriguez (JIRA) Fri, 29 Jan 2016 09:54:12 -0800

Marko A. Rodriguez created TINKERPOP-1108:
---------------------------------------------


             Summary: Produce two RDDs from executeVertexProgram in 
SparkGraphComputer
                 Key: TINKERPOP-1108
                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1108
             Project: TinkerPop
          Issue Type: Improvement
          Components: hadoop
    Affects Versions: 3.1.1-incubating
            Reporter: Marko A. Rodriguez


I have done a lot to optimize our implementation of {{SparkGraphComputer}}. I 
now know the reason for every shuffle, input, spill, etc. piece of data that 
happens during a job. There is one more optimization that MAY or MAY NOT work, 
but it is worth trying because if it does what I think it will do, we may get a 
(perhaps) 2x improvement.

We current do:

{code}
graphRDD -> viewOutgoingMessagesRDD
{code}

We should do:

{code}
graphRDD -->
   viewRDD
   outgoingMessageRDD
{code}

The {{viewRDD}} with have the same partitioner as the {{graphRDD}} and thus, a 
local join is all that is required. The {{outgoingMessageRDD}} will not be 
partitioned so its join will cause shuffle. Thus, after this block, we do:

{code}
graphRDD.join(viewRDD).mapValues(...attach the view...).join(outgoingMessageRDD)
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TINKERPOP-1108) Produce two RDDs from executeVertexProgram in SparkGraphComputer

Reply via email to