Hello,

A fellow named Russ Spitzer kept mentioned "a pointless reduceByKey()" when 
doing SparkGraphComputer jobs. I decided to look into his (probably erroneous) 
claims while working on https://issues.apache.org/jira/browse/TINKERPOP-1120. 
Lo and behold, he was right (impossible!). There is an "extra" reduceByKey() 
that is done that, while it still needs to be done, it shouldn't shuffle so 
much data. There are 5 things in particular I accomplished yesterday:

        -------SparkGraphComputer
        1. If the vertex doesn't pass any messages, don't serialize an empty 
list, serialize null.
        2. If the vertex doesn't have a view, don't serialize an empty list of 
detached vertex properties, serialize null.
        3. If the vertex doesn't have a view nor messages, don't do anything!
        -------TraversalVertexProgram
        4. Found a memory bug where halted traversers were still distributed 
amongst the vertices even though they were sent to the master traversal.
        5. If a halted traverser TraverserSet is empty, remove the property 
(remove the vertex view!).

With that, TraversalVertexProgram (Gremlin OLAP) will typically have nothing to 
reduceByKey() on the final stage as there are no more messages being sent 
(messages) and thus, no more HALTED_TRAVERSERS distributed across the vertex 
set (views). Because of this, the final reduceByKey() that once took 1.2 
minutes now only takes 0.7 seconds. Combine that with the other memory/data 
reductions listed above and we get the following times below.

g.V().count() -- answer 125000000 (125 million vertices)
        - TinkerPop 3.0.0.MX: 2.5 hours
        - TinkerPop 3.0.0:      1.5 hours
        - TinkerPop 3.1.1:      23 minutes
        - TinkerPop 3.2.0:      6.8 minutes (Spark 1.5.2)
        - TinkerPop 3.2.0:      5.5 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      4.5 minutes (Spark 1.6.1)

g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. 
edges))
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      2.5 hours
        - TinkerPop 3.1.1:      1.1 hours
        - TinkerPop 3.2.0:      13 minutes (Spark 1.5.2)
        - TinkerPop 3.2.0:      12 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      10 minutes (Spark 1.6.1)
        
g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths)
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      unknown
        - TinkerPop 3.1.1:      unknown
        - TinkerPop 3.2.0:      55 minutes (Spark 1.5.2)
        - TinkerPop 3.2.0:      50 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      45 minutes (Spark 1.6.1)

g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length 
3-paths)
        - TinkerPop 3.0.0.MX: 12.8 hours
        - TinkerPop 3.0.0:      8.6 hours
        - TinkerPop 3.1.1:      2.4 hours
        - TinkerPop 3.2.0:      1.6 hours (Spark 1.5.2)
        - TinkerPop 3.2.0:      1.5 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      1.3 hours (Spark 1.6.1)

g.V().out().out().out().out().count() -- answer 83841426570464575 (83 
quadrillion length 4-paths)
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      unknown
        - TinkerPop 3.1.1:      unknown
        - TinkerPop 3.2.0:      unknown (Spark 1.5.2)
        - TinkerPop 3.2.0:      2.1 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      1.9 hours (Spark 1.6.1)

g.V().out().out().out().out().out().count() -- answer -2280190503167902456 !! I 
blew the long space -- 64-bit overflow.
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      unknown
        - TinkerPop 3.1.1:      unknown
        - TinkerPop 3.2.0:      unknown (Spark 1.5.2)
        - TinkerPop 3.2.0:      2.8 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      2.4 hours (Spark 1.6.1)

g.V().group().by(outE().count()).by(count()). 
        - TinkerPop 3.2.0:      12 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      12 minutes (Spark 1.6.1)

NOTE: This work is currently not in master/ as it still needs to go through 
Apache TinkerPop VOTE. 

Please tip your waitress,
Marko.

http://markorodriguez.com

Reply via email to