[
https://issues.apache.org/jira/browse/TINKERPOP-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124157#comment-15124157
]
Marko A. Rodriguez commented on TINKERPOP-1108:
-----------------------------------------------
The scary thing about this is that we have Spark accumulators emitted in the
{{viewOutgoingMessageRDD}} and thus, we may have a problem with generating two
RDDs as we might duplicate the accumulator data. However, we may just want to
put the accumulator data into {{viewRDD}} and on the {{join()}}, broadcast the
variables then! ... needs some thinking.
> Produce two RDDs from executeVertexProgram in SparkGraphComputer
> ----------------------------------------------------------------
>
> Key: TINKERPOP-1108
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1108
> Project: TinkerPop
> Issue Type: Improvement
> Components: hadoop
> Affects Versions: 3.1.1-incubating
> Reporter: Marko A. Rodriguez
>
> I have done a lot to optimize our implementation of {{SparkGraphComputer}}. I
> now know the reason for every shuffle, input, spill, etc. piece of data that
> happens during a job. There is one more optimization that MAY or MAY NOT
> work, but it is worth trying because if it does what I think it will do, we
> may get a (perhaps) 2x improvement.
> We current do:
> {code}
> graphRDD -> viewOutgoingMessagesRDD
> {code}
> We should do:
> {code}
> graphRDD -->
> viewRDD
> outgoingMessageRDD
> {code}
> The {{viewRDD}} with have the same partitioner as the {{graphRDD}} and thus,
> a local join is all that is required. The {{outgoingMessageRDD}} will not be
> partitioned so its join will cause shuffle. Thus, after this block, we do:
> {code}
> graphRDD.join(viewRDD).mapValues(...attach the
> view...).join(outgoingMessageRDD)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)