[jira] [Updated] (TINKERPOP-2899) SampleGlobalStep samples inefficiently with TraverserSet running into hash collisions

Yang Xia (Jira) Fri, 24 Mar 2023 12:19:20 -0700


     [ 
https://issues.apache.org/jira/browse/TINKERPOP-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yang Xia updated TINKERPOP-2899:
--------------------------------
    Affects Version/s: 3.5.5

> SampleGlobalStep samples inefficiently with TraverserSet running into hash 
> collisions
> -------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-2899
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2899
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: process
>    Affects Versions: 3.5.5
>            Reporter: steigma
>            Priority: Major
>
> For some queries, the SampleGlobalStep can take a huge amount of time. For 
> example, on a Gremlin variant of the LUBM dataset and the following query
> {code:java}
> g.V().hasLabel('Course').sample(1000).in('teacherOf').path() {code}
> the sample step has 108000 traversers as input, but requires over 200 
> seconds. More precisely: Collecting all traversers from previous step with 
> processAllStarts (see 
> [https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L83-L85])
>  takes 108344 ms, thereby createProjectedTraverser: 1416 ms, 
> traverserSet::add: 106238 ms. The barrierConsumer finished sampling in 121088 
> ms, whereby the loop was executed 53356191 times.
>  
> There seem to be two issues:
>  
> 1. There are many hash collisions when adding the projected traverser to the 
> map in the TraverserSet (see 
> [https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/traverser/util/TraverserSet.java#L91],
>  called from the prcessAllStarts 
> ([https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L84]).
>   For example, for 
> v[[http://www.Department0.University15.edu/Course0|http://www.department0.university15.edu/Course0]],
>  we compute the hash code via {{{}super.hashCode() ^ 
> this.path.hashCode(){}}}, which is {{-2056934079 - -2056934048 = 33}} 
> (basically the hash code of the id string XOR the hash code of the path, 
> i.e., singleton list of the id string). This leads to many very similar hash 
> codes, e.g.,
> {code:java}
> 33
> 103
> 33
> 227
> 33
> 111
> 33
> 995
> 33
> 103
> 33
> 31
> 481
> 39
> 97
> 35
> 225
> 47
> 97
> 35
> 993
> 39
> 225
> 99
> 33
> ... {code}
>  
> 2. The sampling algorithm seems to be extremely inefficient (often running 
> the loop again without adding a new sample). It is not completely clear why 
> it was written that way (lack of documentation), but it may have been 
> necessary to correctly handling these “bulks”.
>  
> The second issue could potentially be addressed by checking whether all 
> traversers have the same weight 
> ([https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/SampleGlobalStep.java#L96]),
>  which seems to be the case in this example, and then use a more efficient 
> sampling approach that just samples a corresponding number of integers 
> between 0 and the number of traversers in the sample set and store these 
> numbers in a set (with count). With that we can iterate through the 
> traverserSet and use the corresponding traversers if we have it sampled.
>  
> For the first issue, we can probably just avoid putting them in the 
> TraverserSet and use an ArrayList instead. Then we could also directly sample 
> from this ArrayList, which may even make this bulk handling superfluous?
>  
> There is however also an "efficient" rewrite of this query using fold, local 
> sample and unfold:
> {code:java}
> g.V().hasLabel('Course').fold().sample(Scope.local,1000).unfold().in('teacherOf').path()
>  {code}
>  
> The general question is, however, whether the {color:#1d1c1d}TraverserSet can 
> lead to hash collisions in other places. Maybe it makes sense to reevaluate 
> the usage of this TraverserSet{color} and, if useful/required, switch to 
> something more efficient time-wise (which could then require a bit more 
> memory).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TINKERPOP-2899) SampleGlobalStep samples inefficiently with TraverserSet running into hash collisions

Reply via email to