[ https://issues.apache.org/jira/browse/TINKERPOP-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yang Xia closed TINKERPOP-1108. ------------------------------- Resolution: Won't Do Closing given [discussion|https://lists.apache.org/thread/om2m0phg25s83529p9w0gldmcxz7578h] - it can be reopened if there is expectation that there will be active work on this item. > Produce two RDDs from executeVertexProgram in SparkGraphComputer > ---------------------------------------------------------------- > > Key: TINKERPOP-1108 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1108 > Project: TinkerPop > Issue Type: Improvement > Components: hadoop > Affects Versions: 3.1.1-incubating > Reporter: Marko A. Rodriguez > Priority: Major > > I have done a lot to optimize our implementation of {{SparkGraphComputer}}. I > now know the reason for every shuffle, input, spill, etc. piece of data that > happens during a job. There is one more optimization that MAY or MAY NOT > work, but it is worth trying because if it does what I think it will do, we > may get a (perhaps) 2x improvement. > We current do: > {code} > graphRDD -> viewOutgoingMessagesRDD > {code} > We should do: > {code} > graphRDD --> > viewRDD > outgoingMessageRDD > {code} > The {{viewRDD}} with have the same partitioner as the {{graphRDD}} and thus, > a local join is all that is required. The {{outgoingMessageRDD}} will not be > partitioned so its join will cause shuffle. Thus, after this block, we do: > {code} > graphRDD.join(viewRDD).mapValues(...attach the > view...).join(outgoingMessageRDD) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)