GitHub user okram opened a pull request:
https://github.com/apache/incubator-tinkerpop/pull/301
TINKERPOP-1120: If there is no view nor messages, don't create empty
views/messages in SparkExecutor
https://issues.apache.org/jira/browse/TINKERPOP-1120
The following PR effects TraversalVertexProgram and SparkGraphComputer.
Here is what changed in both:
-------SparkGraphComputer
1. If the vertex doesn't pass any messages, don't serialize an empty
list, serialize null.
2. If the vertex doesn't have a view, don't serialize an empty list of
detached vertex properties, serialize null.
3. If the vertex doesn't have a view nor messages, don't do anything!
-------TraversalVertexProgram
4. Found a memory bug where halted traversers were still distributed
amongst the vertices even though they were sent to the master traversal.
5. If a halted traverser TraverserSet is empty, remove the property
(remove the vertex view!).
You can read about the performance gains by doing this here:
https://groups.google.com/d/msg/gremlin-users/NKjEXdRNp-M/S48pDXjdAQAJ
CHANGELOG
```
* `SparkGraphComputer` no longer shuffles empty views or empty outgoing
messages in order to save time and space.
* `TraversalVertexProgram` no longer maintains empty halted traverser
properties in order to save space.
```
UPGRADE
```
TraversalVertexProgram
----------------------
`TraversalVertexProgram` always maintained a `HALTED_TRAVERSERS`
`TraverserSet` for each vertex throughout the life of the OLAP computation.
However, if there are no halted traversers in the set, then there is no point
in keeping the compute property around as without it, time and space are saved.
Users that have `VertexPrograms` that are chained off of
`TraversalVertexProgram` that have previously assumed that `HALTED_TRAVERSERS`
always exists, should now no longer assume that.
---java
// bad code
TraverserSet haltedTraversers =
vertex.value(TraversalVertexProgram.HALTED_TRAVERSERS);
// good code
TraverserSet haltedTraversers =
vertex.property(TraversalVertexProgram.HALTED_TRAVERSERS).orElse(new
TraverserSet());
---
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1120
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-tinkerpop/pull/301.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #301
----
commit 4a7888681152cb005d632416371b7f66da6f119c
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-02T21:06:53Z
Empty lists are not created if no messages or views are created. Instead
the payload is null. This helps to reduce memory footprint both RAM and during
shuffle/disk/network.
commit cd5524d73928c0e9b8a2260fad1b1e29c3f53ef5
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-02T22:16:15Z
a bunch of nick-nack optimizations generally in TraversalVertexProgram and
specifically in SparkGraphComputer. If there are no HALTED_TRAVERSERS, then do
not propagate an empty set -- property.remove(). In Spark, if there are no
outgoing messages or new view, do not propagate empty ViewPayloads -- using
null. Found a memory bug in TraversalVertexProgram where if the
HALTED_TRAVERSERS are suppose to go back to the master traverasl, they were
still being persisted across the vertices. These tweaks should definately
reduce stress on large graphs as the memory footprint is greatly reduced.
Unfortutnately, we still need reduceByKey() even on empty views/messages as its
not known that its empty until after the action.
commit e3a4b7ff9bd730b7056b4ab224ea8e9255263c9b
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-02T22:25:30Z
another null memory tweak. no point sending around empty lists --- using
null instead.
commit 6f13c0cfc20d8c0cbf1681359792e543bd3676bc
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-02T22:36:26Z
more minor memory tweaks. running integration tests over night.
commit 79ebaf9f94f0b645ba493551ff219a786003cc85
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-03T01:03:11Z
finally figured out how to do a reduceByKey() with empty tuples. This is
the super optimization -- if there are no views and no outgoing messages, then
the reduceByKey is trivially complex. For TraversalVertexProgram, this means
that the final step takes no time at all. Running integration tests overnight.
commit 8fd9502160b7940a806247a16406663ff4b27826
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-05-03T13:49:14Z
some last minute cleanups, comments before PR. integration tests passed
over night. Spark integration tests passed for these changes right now.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---