[ 
https://issues.apache.org/jira/browse/TINKERPOP-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

steigma updated TINKERPOP-2899:
-------------------------------
    Description: 
For some queries, the SampleGlobalStep can take a huge amount of time. For 
example, on a Gremlin variant of the LUBM dataset and the following query
{code:java}
g.V().hasLabel('Course').sample(1000).in('teacherOf').path() {code}
the sample step has 108000 traversers as input, but requires over 200 seconds. 
More precisely:
Finished collecting all traversers from previous step with processAllStarts 
(see 
[https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L83-L85])
 takes 108344 ms, thereby createProjectedTraverser: 1416 ms, traverserSet::add: 
106238 ms. The barrierConsumer finished sampling in 121088 ms, whereby the loop 
was executed 53356191 times.
 
There seem to be two issues:
 
1. There are many hash collisions when adding the projected traverser to the 
map in the TraverserSet (see 
[https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/traverser/util/TraverserSet.java#L91],
 called from the prcessAllStarts 
([https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L84]).
  For example, for 
v[[http://www.Department0.University15.edu/Course0|http://www.department0.university15.edu/Course0]],
 we compute the hash code via {{{}super.hashCode() ^ this.path.hashCode(){}}}, 
which is {{-2056934079 - -2056934048 = 33}} (basically the hash code of the id 
string XOR the hash code of the path, i.e., singleton list of the id string). 
This leads to many very similar hash codes, e.g.,
{code:java}
33
103
33
227
33
111
33
995
33
103
33
31
481
39
97
35
225
47
97
35
993
39
225
99
33
... {code}

 
2. The sampling algorithm seems to be extremely inefficient (often running the 
loop again without adding a new sample). It is not completely clear why it was 
written that way (lack of documentation), but it may have been necessary to 
correctly handling these “bulks”.
 
The second issue could potentially be addressed by checking whether all 
traversers have the same weight 
([https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L96]),
 which seems to be the case in this example, and then use a more efficient 
sampling approach that just samples a corresponding number of integers between 
0 and the number of traversers in the sample set and store these numbers in a 
set (with count). With that we can iterate through the traverserSet and use the 
corresponding traversers if we have it sampled.
 
For the first issue, we can probably just avoid putting them in the 
TraverserSet and use an ArrayList instead. Then we could also directly sample 
from this ArrayList, which may even make this bulk handling superfluous?
 
There is however also an "efficient" rewrite of this query using fold, local 
sample and unfold:
g.V().hasLabel('Course').fold().sample(Scope.local,1000).unfold().in('teacherOf').path()
 
The general question is, however, whether the {color:#1d1c1d}TraverserSet can 
lead to hash collisions in other places. Maybe it makes sense to reevaluate the 
usage of this TraverserSet{color} and, if useful/required, switch to something 
more efficient time-wise (which could then require a bit more memory).

  was:
For some queries, the SampleGlobalStep can take a huge amount of time. For 
example, on a Gremlin variant of the LUBM dataset and the following query
{code:java}
g.V().hasLabel('Course').sample(1000).in('teacherOf').path() {code}
the sample step has 108000 traversers as input, but requires over 200 seconds. 
More precisely:
Finished collecting all traversers from previous step with processAllStarts 
(see 
https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L83-L85)
 takes 108344 ms, thereby createProjectedTraverser: 1416 ms, traverserSet::add: 
106238 ms. The barrierConsumer finished sampling in 121088 ms, whereby the loop 
was executed 53356191 times.
 
There seem to be two issues:
 
1. There are many hash collisions when adding the projected traverser to the 
map in the TraverserSet (see 
https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/traverser/util/TraverserSet.java#L91,
 called from the prcessAllStarts 
(https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L84).
  For example, for 
v[[http://www.Department0.University15.edu/Course0|http://www.department0.university15.edu/Course0]],
 we compute the hash code via {{{}super.hashCode() ^ this.path.hashCode(){}}}, 
which is {{-2056934079 - -2056934048 = 33}} (basically the hash code of the id 
string XOR the hash code of the path, i.e., singleton list of the id string). 
This leads to many very similar hash codes, e.g.,
33
103
33
227
33
111
33
995
33
103
33
31
481
39
97
35
225
47
97
35
993
39
225
99
33
...
 
2. The sampling algorithm seems to be extremely inefficient (often running the 
loop again without adding a new sample). It is not completely clear why it was 
written that way (lack of documentation), but it may have been necessary to 
correctly handling these “bulks”.
 
The second issue could potentially be addressed by checking whether all 
traversers have the same weight 
(https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L96),
 which seems to be the case in this example, and then use a more efficient 
sampling approach that just samples a corresponding number of integers between 
0 and the number of traversers in the sample set and store these numbers in a 
set (with count). With that we can iterate through the traverserSet and use the 
corresponding traversers if we have it sampled.
 
For the first issue, we can probably just avoid putting them in the 
TraverserSet and use an ArrayList instead. Then we could also directly sample 
from this ArrayList, which may even make this bulk handling superfluous?
 
There is however also an "efficient" rewrite of this query using fold, local 
sample and unfold:
g.V().hasLabel('Course').fold().sample(Scope.local,1000).unfold().in('teacherOf').path()
 
The general question is, however, whether the {color:#1d1c1d}TraverserSet can 
lead to hash collisions in other places. Maybe it makes sense to reevaluate the 
usage of this {color:#1d1c1d}TraverserSet{color} and, if useful/required, 
switch to something more efficient time-wise (which could then require a bit 
more memory).{color}


> SampleGlobalStep samples inefficiently with TraverserSet running into hash 
> collisions
> -------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-2899
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2899
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: process
>            Reporter: steigma
>            Priority: Major
>
> For some queries, the SampleGlobalStep can take a huge amount of time. For 
> example, on a Gremlin variant of the LUBM dataset and the following query
> {code:java}
> g.V().hasLabel('Course').sample(1000).in('teacherOf').path() {code}
> the sample step has 108000 traversers as input, but requires over 200 
> seconds. More precisely:
> Finished collecting all traversers from previous step with processAllStarts 
> (see 
> [https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L83-L85])
>  takes 108344 ms, thereby createProjectedTraverser: 1416 ms, 
> traverserSet::add: 106238 ms. The barrierConsumer finished sampling in 121088 
> ms, whereby the loop was executed 53356191 times.
>  
> There seem to be two issues:
>  
> 1. There are many hash collisions when adding the projected traverser to the 
> map in the TraverserSet (see 
> [https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/traverser/util/TraverserSet.java#L91],
>  called from the prcessAllStarts 
> ([https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L84]).
>   For example, for 
> v[[http://www.Department0.University15.edu/Course0|http://www.department0.university15.edu/Course0]],
>  we compute the hash code via {{{}super.hashCode() ^ 
> this.path.hashCode(){}}}, which is {{-2056934079 - -2056934048 = 33}} 
> (basically the hash code of the id string XOR the hash code of the path, 
> i.e., singleton list of the id string). This leads to many very similar hash 
> codes, e.g.,
> {code:java}
> 33
> 103
> 33
> 227
> 33
> 111
> 33
> 995
> 33
> 103
> 33
> 31
> 481
> 39
> 97
> 35
> 225
> 47
> 97
> 35
> 993
> 39
> 225
> 99
> 33
> ... {code}
>  
> 2. The sampling algorithm seems to be extremely inefficient (often running 
> the loop again without adding a new sample). It is not completely clear why 
> it was written that way (lack of documentation), but it may have been 
> necessary to correctly handling these “bulks”.
>  
> The second issue could potentially be addressed by checking whether all 
> traversers have the same weight 
> ([https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L96]),
>  which seems to be the case in this example, and then use a more efficient 
> sampling approach that just samples a corresponding number of integers 
> between 0 and the number of traversers in the sample set and store these 
> numbers in a set (with count). With that we can iterate through the 
> traverserSet and use the corresponding traversers if we have it sampled.
>  
> For the first issue, we can probably just avoid putting them in the 
> TraverserSet and use an ArrayList instead. Then we could also directly sample 
> from this ArrayList, which may even make this bulk handling superfluous?
>  
> There is however also an "efficient" rewrite of this query using fold, local 
> sample and unfold:
> g.V().hasLabel('Course').fold().sample(Scope.local,1000).unfold().in('teacherOf').path()
>  
> The general question is, however, whether the {color:#1d1c1d}TraverserSet can 
> lead to hash collisions in other places. Maybe it makes sense to reevaluate 
> the usage of this TraverserSet{color} and, if useful/required, switch to 
> something more efficient time-wise (which could then require a bit more 
> memory).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to