[ https://issues.apache.org/jira/browse/TINKERPOP-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804224#comment-15804224 ]
ASF GitHub Bot commented on TINKERPOP-1585: ------------------------------------------- Github user dkuppitz commented on a diff in the pull request: https://github.com/apache/tinkerpop/pull/524#discussion_r94925652 --- Diff: gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/filter/DedupGlobalStep.java --- @@ -89,6 +92,17 @@ public ElementRequirement getMaxRequirement() { @Override protected Traverser.Admin<S> processNextStart() { + if (null != this.barrier) { + this.barrierIterator = this.barrier.entrySet().iterator(); + this.barrier = null; + } + while (this.barrierIterator != null && this.barrierIterator.hasNext()) { + if (null == this.barrierIterator) --- End diff -- `this.barrierIterator` can never be null within within the `while()` loop. Unless I overlooked something fundamental, `processNextStart` can be simplified to: ``` protected Traverser.Admin<S> processNextStart() { if (null != this.barrier) { this.barrierIterator = this.barrier.entrySet().iterator(); this.barrier = null; while (this.barrierIterator.hasNext()) { final Map.Entry<Object, Traverser.Admin<S>> entry = this.barrierIterator.next(); if (this.duplicateSet.add(entry.getKey())) return PathProcessor.processTraverserPathLabels(entry.getValue(), this.keepLabels); } } return PathProcessor.processTraverserPathLabels(super.processNextStart(), this.keepLabels); } ``` > OLAP dedup over non elements > ---------------------------- > > Key: TINKERPOP-1585 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1585 > Project: TinkerPop > Issue Type: Bug > Components: hadoop, process > Affects Versions: 3.2.3 > Reporter: Daniel Kuppitz > Assignee: Marko A. Rodriguez > > OLAP {{dedup()}} is highly inefficient when it's fed with non elements. > In a customer project a query similar tho the following returned a result in > slightly more than 6 seconds: > {noformat} > persistedRDD. > V().hasLabel("label1","label2"). > inE("edgeLabel1","edgeLabel2").outV(). > id().count() > {noformat} > The same query with {{dedup()}} added: > {noformat} > persistedRDD. > V().hasLabel("label1","label2"). > inE("edgeLabel1","edgeLabel2").outV(). > id().dedup().count() > {noformat} > ...took more than 120 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)