[jira] [Commented] (TINKERPOP-1131) TraversalVertexProgram traverser management is inefficient memory-wise.

ASF GitHub Bot (JIRA) Fri, 05 Feb 2016 14:05:53 -0800

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135119#comment-15135119
 ]


ASF GitHub Bot commented on TINKERPOP-1131:
-------------------------------------------

GitHub user okram opened a pull request:

    https://github.com/apache/incubator-tinkerpop/pull/214

    TINKERPOP-1131: TraversalVertexProgram traverser management is inefficient 
memory-wise.

    https://issues.apache.org/jira/browse/TINKERPOP-1131
    
    This will go into TinkerPop 3.1.2. I will then upmerge it to master/ 
(3.2.0) and when https://github.com/apache/incubator-tinkerpop/pull/210 is 
merged to master/ (3.2.0), we will have a lean-mean OLAP processing machine!
    
    I realized that I was copying `TraverserSets` instead of draining one into 
the other. By draining, for every traverser put into one set, it is removed 
from the other. In the worst case, we could have up to 3 sets of equivalent 
size all in memory at one vertex in OLAP. If you do any sort of `outE()`-type 
traversal, thats ALOT of data. Moreover, I reorganized the flow of processing 
as previously I was determining if messages needed to be sent right after 
received messages! Pointless waste of CPU cycles. Finally, I create a "bulking 
model" where by I try and fill a `Step` with as many traversers as I can before 
I have to drain it for message passing. Prior, it was one traverser at a time 
-- again, this can lead to significant inefficiencies.
    
    `mvn clean install` -- integration tests (Giraph still running, but it will 
work as if one `TraversalVertexProgram` test passes then the meat is right). 
    
    VOTE +1.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1131

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-tinkerpop/pull/214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #214
    
----
commit b54ddf2830483705c4c4a865b9f2586ed457223a
Author: Marko A. Rodriguez <[email protected]>
Date:   2016-02-05T21:56:10Z

    Made significant memory improvements to TraverserExecutor. Realized some 
massive heaps on some jobs on Friendster using SparkGraphComputer and tracked 
it down to how I'm dealing with traversers in TraverserVertexProgram. I was not 
'draining' sets of traversers and thus, was using an excessive amount of 
memory. This really shows itself when touching edges where you can easily 
generate million of traversers and to have multiple copies of that data is bad. 
To make draining work, I had to update all the Iterators to support .remove() 
which simply call .remove() of the child iterator. Found a simple optimization 
for CountGlobalStep that will make OLAP counting much faster.

----


> TraversalVertexProgram traverser management is inefficient memory-wise.
> -----------------------------------------------------------------------
>
>                 Key: TINKERPOP-1131
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1131
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: process
>    Affects Versions: 3.1.1-incubating
>            Reporter: Marko A. Rodriguez
>            Assignee: Marko A. Rodriguez
>             Fix For: 3.1.2-incubating
>
>
> The traversers incoming to a vertex at an iteration are in a 
> {{TraverserSet}}. We iterate that set and attach the traversers to their 
> respective local object (e.g. vertex, edge, property, etc.). This creates a 
> {{toProcess}} {{TraverserSet}}. At this point, we have 2 sets the same size! 
> We NEVER clear the message set and process the {{toProcess}} traversers to 
> create an {{aliveTraversers}} set. Now, 3 sets! If you have millions of edges 
> on an {{outE()}} you have 3 million entry sets (nasty!). We then set 
> {{toProcess}} to {{aliveTraversers}} and keep doing this until the set is 
> completely empty. (they empty when a traverser needs to go to another vertex 
> to keep processing -- a message pass).
> So, to preserve memory we need to "drain" the {{TraverserSets}}. That is, 
> iterate and {{remove()}} so that we don't create set clones and blow heap and 
> cause (e.g.) {{SparkGraphComputer}} to spill memory to disk. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1131) TraversalVertexProgram traverser management is inefficient memory-wise.

Reply via email to