[
https://issues.apache.org/jira/browse/TINKERPOP-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yang Xia closed TINKERPOP-1108.
-------------------------------
Resolution: Won't Do
Closing given
[discussion|https://lists.apache.org/thread/om2m0phg25s83529p9w0gldmcxz7578h] -
it can be reopened if there is expectation that there will be active work on
this item.
> Produce two RDDs from executeVertexProgram in SparkGraphComputer
> ----------------------------------------------------------------
>
> Key: TINKERPOP-1108
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1108
> Project: TinkerPop
> Issue Type: Improvement
> Components: hadoop
> Affects Versions: 3.1.1-incubating
> Reporter: Marko A. Rodriguez
> Priority: Major
>
> I have done a lot to optimize our implementation of {{SparkGraphComputer}}. I
> now know the reason for every shuffle, input, spill, etc. piece of data that
> happens during a job. There is one more optimization that MAY or MAY NOT
> work, but it is worth trying because if it does what I think it will do, we
> may get a (perhaps) 2x improvement.
> We current do:
> {code}
> graphRDD -> viewOutgoingMessagesRDD
> {code}
> We should do:
> {code}
> graphRDD -->
> viewRDD
> outgoingMessageRDD
> {code}
> The {{viewRDD}} with have the same partitioner as the {{graphRDD}} and thus,
> a local join is all that is required. The {{outgoingMessageRDD}} will not be
> partitioned so its join will cause shuffle. Thus, after this block, we do:
> {code}
> graphRDD.join(viewRDD).mapValues(...attach the
> view...).join(outgoingMessageRDD)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)