GitHub user okram opened a pull request:
https://github.com/apache/incubator-tinkerpop/pull/214
TINKERPOP-1131: TraversalVertexProgram traverser management is inefficient
memory-wise.
https://issues.apache.org/jira/browse/TINKERPOP-1131
This will go into TinkerPop 3.1.2. I will then upmerge it to master/
(3.2.0) and when https://github.com/apache/incubator-tinkerpop/pull/210 is
merged to master/ (3.2.0), we will have a lean-mean OLAP processing machine!
I realized that I was copying `TraverserSets` instead of draining one into
the other. By draining, for every traverser put into one set, it is removed
from the other. In the worst case, we could have up to 3 sets of equivalent
size all in memory at one vertex in OLAP. If you do any sort of `outE()`-type
traversal, thats ALOT of data. Moreover, I reorganized the flow of processing
as previously I was determining if messages needed to be sent right after
received messages! Pointless waste of CPU cycles. Finally, I create a "bulking
model" where by I try and fill a `Step` with as many traversers as I can before
I have to drain it for message passing. Prior, it was one traverser at a time
-- again, this can lead to significant inefficiencies.
`mvn clean install` -- integration tests (Giraph still running, but it will
work as if one `TraversalVertexProgram` test passes then the meat is right).
VOTE +1.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1131
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-tinkerpop/pull/214.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #214
----
commit b54ddf2830483705c4c4a865b9f2586ed457223a
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-05T21:56:10Z
Made significant memory improvements to TraverserExecutor. Realized some
massive heaps on some jobs on Friendster using SparkGraphComputer and tracked
it down to how I'm dealing with traversers in TraverserVertexProgram. I was not
'draining' sets of traversers and thus, was using an excessive amount of
memory. This really shows itself when touching edges where you can easily
generate million of traversers and to have multiple copies of that data is bad.
To make draining work, I had to update all the Iterators to support .remove()
which simply call .remove() of the child iterator. Found a simple optimization
for CountGlobalStep that will make OLAP counting much faster.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---