GitHub user okram opened a pull request: https://github.com/apache/tinkerpop/pull/524
TINKERPOP-1585 & TINKERPOP-1590: DedupGlobalStep and Distributed TinkerMemory https://issues.apache.org/jira/browse/TINKERPOP-1585 https://issues.apache.org/jira/browse/TINKERPOP-1590 This was a week long project with @dkuppitz. In short, we made `DedupGlobalStep` a bit better. It had a bug where it was deduping (in OLAP) twice at the master traversals. That is no longer and issue. Moreover, I added an explicit barrier data structure to reduce the amount of data movement between the master barrier and the master step. When working on this, I originally thought the bug was in `TinkerMemory` (`TinkerGraphComputer`) where there is lock contention for barrier data. There is! and I was able to seep up a toy traversal we had from 7 seconds to 2.9 seconds cause of it by having each `TinkerWorker` maintain its own memory data structure that is then merged into the master `TinkerMemory` when the iteration is complete. Standard, we just never thought to do this. Lots of other nick nack goodies... ``` * Fixed an optimization bug in OLAP-based `DedupGlobalStep` where deduping occurred twice. * `MemoryComputeKey` now implements `Cloneable` which is useful for `BiOperator` reducers that maintain thread-unsafe state. * `TinkerGraphComputer` now supports distributed `Memory` with lock-free partition aggregation. ``` VOTE +1. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/tinkerpop TINKERPOP-1585 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tinkerpop/pull/524.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #524 ---- commit 2ddc632494c96163f2a95e32ce73f03f38101262 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-03T18:31:33Z realized that DedupGlobalStep was doing a 'double dedup'. the barrier which is already dedup'd is added to the master traveral's dedup-step and then dedup'd again. I simple boolean flag added to determine if the master traversal's dedup is getting data from workers or not and thus, if so, don't dedup again. commit f352b018f2c7d389abbf759e17fa1b094c14c63c Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-03T18:51:38Z minor nothing. commit 0db0991cc092e33091068f59f81c27feb9712379 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T00:59:39Z Added TinkerWorkerMemory which will aggregate local to the current thread before propagated Memory to TinkerMemory. This reduces synchronization issues due all threads contending to mutate the master memory. commit 056d7aedffa83f8d06462617670273857e2bea19 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T01:00:18Z forgot to add TinkerWorkerMemory file. commit 1ac003d00807c1351594d04a4bbdb55e93b00134 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T11:06:30Z Added Serializer.cloneObject() which clones via serialization (helper method). TinkerGraphComputer now how a sound distributed Memory system where each worker/thread aggregates without concurrency locally and then, at the end of the iteration, the thread-distributed memories are aggregated into the main TinkerMemory. commit 8deca70680e8c89b31fea1cc99300740f45eec56 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T11:55:12Z Made is so MemoryComputeKey implements Cloneable. This is actually really important we have NOT been cloning the BiOperators of OrderGlobalStep and GroupStep. We have just been 'getting lucky' in that Spark and Giraph use Serialization and thus we get a clone for free. However, for parallelization within a JVM, we woulld have issues except we never realized because we had a single global Memory for TinkerGraph. Now we don't and clone()ing bi operators works. commit cef19796ffd2ad27074174ad7f90b9e1f24bc07e Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T12:05:09Z updated CHANGELOG. Going to move on. Need example Gryo dataset from @dkuppitz to verify performance improvement of DedupGlobalStep on SparkGraphComputer. commit 8d961285f687107ffc5c1d8a3bca7c78ea833adf Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T17:32:21Z discovering various synchronization bottlenecks in TinkerGraphComputer. Also, realized some dumb things I was doing in TraversalVertexProgram. Its crazy, for this benchmark that @dkuppitz and i have, if I don't touch vertex.properties that are compute keys: millisecond return times. If I do, seconds return times.... Need to figure out how to partition TinkerGraphView... Perhaps thread local..dah. commit 27a4b271c54f0ab06fc8c0ad0578e4433d6db536 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T17:49:02Z CHAGELOG updated. Lunch time. commit 3dd1f6e5e141d715e678c66acd0516806e625047 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T19:39:35Z no more SynchronizedIterator for TinkerGraphComputer. Each worker/thread has their own Iterator<Vertex> partition. In TinkerPop 3.3.0 (with Partitioner) this will be replaced (easily). commit 253248e52b334d3990f6364d45174abca00476cd Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T19:50:17Z now that MemoryComputeKeys are cloneable, they are cloned accordignly in TraversalVertexProgram.clone(). commit 64c80657bf9199d5b9f28f56b65f1ce327192bd1 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-04T20:13:58Z added the concept of a master barrier to DedupGlobalStep to speed up the addition of worker barriers into the master traversal. A slight performance increase. commit ba390748443c603009b5a59b6dc64882a4819770 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-05T17:34:23Z more minor optimizations to DedupGlobalStep. commit e9258086441d013e9961d81473b275054d41d8cc Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-05T20:29:32Z have DedupGlobalStep as optimized as I can get it. I think BloomFilters are the next thing. Also, detachment factory stuff will help reduce barrier sizes. Found a bug in SparkInterceptorStrategyTest. Fixed. commit dc38b435341d587c4d3c89bbc4b5261148077b0a Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2017-01-05T23:59:14Z removed the toy file and updated CHANGELOG. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---