[
https://issues.apache.org/jira/browse/TINKERPOP-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yang Xia updated TINKERPOP-2899:
--------------------------------
Affects Version/s: 3.5.5
> SampleGlobalStep samples inefficiently with TraverserSet running into hash
> collisions
> -------------------------------------------------------------------------------------
>
> Key: TINKERPOP-2899
> URL: https://issues.apache.org/jira/browse/TINKERPOP-2899
> Project: TinkerPop
> Issue Type: Improvement
> Components: process
> Affects Versions: 3.5.5
> Reporter: steigma
> Priority: Major
>
> For some queries, the SampleGlobalStep can take a huge amount of time. For
> example, on a Gremlin variant of the LUBM dataset and the following query
> {code:java}
> g.V().hasLabel('Course').sample(1000).in('teacherOf').path() {code}
> the sample step has 108000 traversers as input, but requires over 200
> seconds. More precisely: Collecting all traversers from previous step with
> processAllStarts (see
> [https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L83-L85])
> takes 108344 ms, thereby createProjectedTraverser: 1416 ms,
> traverserSet::add: 106238 ms. The barrierConsumer finished sampling in 121088
> ms, whereby the loop was executed 53356191 times.
>
> There seem to be two issues:
>
> 1. There are many hash collisions when adding the projected traverser to the
> map in the TraverserSet (see
> [https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/traverser/util/TraverserSet.java#L91],
> called from the prcessAllStarts
> ([https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L84]).
> For example, for
> v[[http://www.Department0.University15.edu/Course0|http://www.department0.university15.edu/Course0]],
> we compute the hash code via {{{}super.hashCode() ^
> this.path.hashCode(){}}}, which is {{-2056934079 - -2056934048 = 33}}
> (basically the hash code of the id string XOR the hash code of the path,
> i.e., singleton list of the id string). This leads to many very similar hash
> codes, e.g.,
> {code:java}
> 33
> 103
> 33
> 227
> 33
> 111
> 33
> 995
> 33
> 103
> 33
> 31
> 481
> 39
> 97
> 35
> 225
> 47
> 97
> 35
> 993
> 39
> 225
> 99
> 33
> ... {code}
>
> 2. The sampling algorithm seems to be extremely inefficient (often running
> the loop again without adding a new sample). It is not completely clear why
> it was written that way (lack of documentation), but it may have been
> necessary to correctly handling these “bulks”.
>
> The second issue could potentially be addressed by checking whether all
> traversers have the same weight
> ([https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L96]),
> which seems to be the case in this example, and then use a more efficient
> sampling approach that just samples a corresponding number of integers
> between 0 and the number of traversers in the sample set and store these
> numbers in a set (with count). With that we can iterate through the
> traverserSet and use the corresponding traversers if we have it sampled.
>
> For the first issue, we can probably just avoid putting them in the
> TraverserSet and use an ArrayList instead. Then we could also directly sample
> from this ArrayList, which may even make this bulk handling superfluous?
>
> There is however also an "efficient" rewrite of this query using fold, local
> sample and unfold:
> {code:java}
> g.V().hasLabel('Course').fold().sample(Scope.local,1000).unfold().in('teacherOf').path()
> {code}
>
> The general question is, however, whether the {color:#1d1c1d}TraverserSet can
> lead to hash collisions in other places. Maybe it makes sense to reevaluate
> the usage of this TraverserSet{color} and, if useful/required, switch to
> something more efficient time-wise (which could then require a bit more
> memory).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)