GitHub user okram opened a pull request: https://github.com/apache/incubator-tinkerpop/pull/301
TINKERPOP-1120: If there is no view nor messages, don't create empty views/messages in SparkExecutor https://issues.apache.org/jira/browse/TINKERPOP-1120 The following PR effects TraversalVertexProgram and SparkGraphComputer. Here is what changed in both: -------SparkGraphComputer 1. If the vertex doesn't pass any messages, don't serialize an empty list, serialize null. 2. If the vertex doesn't have a view, don't serialize an empty list of detached vertex properties, serialize null. 3. If the vertex doesn't have a view nor messages, don't do anything! -------TraversalVertexProgram 4. Found a memory bug where halted traversers were still distributed amongst the vertices even though they were sent to the master traversal. 5. If a halted traverser TraverserSet is empty, remove the property (remove the vertex view!). You can read about the performance gains by doing this here: https://groups.google.com/d/msg/gremlin-users/NKjEXdRNp-M/S48pDXjdAQAJ CHANGELOG ``` * `SparkGraphComputer` no longer shuffles empty views or empty outgoing messages in order to save time and space. * `TraversalVertexProgram` no longer maintains empty halted traverser properties in order to save space. ``` UPGRADE ``` TraversalVertexProgram ---------------------- `TraversalVertexProgram` always maintained a `HALTED_TRAVERSERS` `TraverserSet` for each vertex throughout the life of the OLAP computation. However, if there are no halted traversers in the set, then there is no point in keeping the compute property around as without it, time and space are saved. Users that have `VertexPrograms` that are chained off of `TraversalVertexProgram` that have previously assumed that `HALTED_TRAVERSERS` always exists, should now no longer assume that. ---java // bad code TraverserSet haltedTraversers = vertex.value(TraversalVertexProgram.HALTED_TRAVERSERS); // good code TraverserSet haltedTraversers = vertex.property(TraversalVertexProgram.HALTED_TRAVERSERS).orElse(new TraverserSet()); --- ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1120 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tinkerpop/pull/301.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #301 ---- commit 4a7888681152cb005d632416371b7f66da6f119c Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-02T21:06:53Z Empty lists are not created if no messages or views are created. Instead the payload is null. This helps to reduce memory footprint both RAM and during shuffle/disk/network. commit cd5524d73928c0e9b8a2260fad1b1e29c3f53ef5 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-02T22:16:15Z a bunch of nick-nack optimizations generally in TraversalVertexProgram and specifically in SparkGraphComputer. If there are no HALTED_TRAVERSERS, then do not propagate an empty set -- property.remove(). In Spark, if there are no outgoing messages or new view, do not propagate empty ViewPayloads -- using null. Found a memory bug in TraversalVertexProgram where if the HALTED_TRAVERSERS are suppose to go back to the master traverasl, they were still being persisted across the vertices. These tweaks should definately reduce stress on large graphs as the memory footprint is greatly reduced. Unfortutnately, we still need reduceByKey() even on empty views/messages as its not known that its empty until after the action. commit e3a4b7ff9bd730b7056b4ab224ea8e9255263c9b Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-02T22:25:30Z another null memory tweak. no point sending around empty lists --- using null instead. commit 6f13c0cfc20d8c0cbf1681359792e543bd3676bc Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-02T22:36:26Z more minor memory tweaks. running integration tests over night. commit 79ebaf9f94f0b645ba493551ff219a786003cc85 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-03T01:03:11Z finally figured out how to do a reduceByKey() with empty tuples. This is the super optimization -- if there are no views and no outgoing messages, then the reduceByKey is trivially complex. For TraversalVertexProgram, this means that the final step takes no time at all. Running integration tests overnight. commit 8fd9502160b7940a806247a16406663ff4b27826 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-03T13:49:14Z some last minute cleanups, comments before PR. integration tests passed over night. Spark integration tests passed for these changes right now. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---