[ https://issues.apache.org/jira/browse/TINKERPOP-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767603#comment-15767603 ]
Daniel Kuppitz commented on TINKERPOP-1585: ------------------------------------------- Quick performance test / comparison over TinkerGraph: {noformat} graph = TinkerGraph.open() g = graph.traversal() a = graph.traversal().withComputer() r = new Random(123) (1..1000000).each { def vid = ["a","b","c","d"].collectEntries {[it, r.nextInt() % 400000]} graph.addVertex(id, vid) }; [] clockWithResult(1) {g.V().id().select("c").count().next()} clockWithResult(1) {g.V().id().select("c").dedup().count().next()} clockWithResult(1) {a.V().id().select("c").count().next()} clockWithResult(1) {a.V().id().select("c").dedup().count().next()} {noformat} {noformat} gremlin> clockWithResult(1) {g.V().id().select("c").count().next()} ==>22.258808 ==>1000000 gremlin> clockWithResult(1) {g.V().id().select("c").dedup().count().next()} ==>727.913942 ==>570723 gremlin> clockWithResult(1) {a.V().id().select("c").count().next()} ==>23448.141182 ==>1000000 gremlin> clockWithResult(1) {a.V().id().select("c").dedup().count().next()} ==>31519.832272 ==>570723 {noformat} Spark is a lot faster with the no-dedup traversal, but it probably takes advantage of multiple parallel threads. (On my local machine) Spark (w/o TinkerPop) needs about 600 ms to count the number of Strings in a list of 1M items and approx. 5 seconds to count the number of distinct items. Maybe a Spark interceptor is the way to go. > OLAP dedup over non elements > ---------------------------- > > Key: TINKERPOP-1585 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1585 > Project: TinkerPop > Issue Type: Bug > Components: hadoop, process > Affects Versions: 3.2.3 > Reporter: Daniel Kuppitz > Assignee: Marko A. Rodriguez > > OLAP {{dedup()}} is highly inefficient when it's fed with non elements. > In a customer project a query similar tho the following returned a result in > slightly more than 6 seconds: > {noformat} > persistedRDD. > V().hasLabel("label1","label2"). > inE("edgeLabel1","edgeLabel2").outV(). > id().count() > {noformat} > The same query with {{dedup()}} added: > {noformat} > persistedRDD. > V().hasLabel("label1","label2"). > inE("edgeLabel1","edgeLabel2").outV(). > id().dedup().count() > {noformat} > ...took more than 120 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)