This is an automated email from the ASF dual-hosted git repository. dkuppitz pushed a commit to branch TINKERPOP-2114 in repository https://gitbox.apache.org/repos/asf/tinkerpop.git
commit a9bf2205fd0100d22fce49120e3f5859d5b8bca7 Author: Daniel Kuppitz <[email protected]> AuthorDate: Mon Dec 10 08:17:50 2018 -0700 TINKERPOP-2114 Documented common Gremlin anti-patterns (and solutions) --- docs/src/recipes/anti-patterns.asciidoc | 290 ++++++++++++++++++++++++++++++++ docs/src/recipes/index.asciidoc | 4 + 2 files changed, 294 insertions(+) diff --git a/docs/src/recipes/anti-patterns.asciidoc b/docs/src/recipes/anti-patterns.asciidoc new file mode 100644 index 0000000..8026d08 --- /dev/null +++ b/docs/src/recipes/anti-patterns.asciidoc @@ -0,0 +1,290 @@ +//// +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +//// + +[[long-traversals]] +== Long Traversals + +It can be tempting to generate long traversals, e.g. to create a set of vertices and edges based on information that +resides within an application. For example, let's consider two lists - one that contains information about persons and +another that contains information about the relationship between these persons. To illustrate the problem we will +create two list with a few random map entries. + +[gremlin-groovy] +---- +:set max-iteration 10 +rnd = new Random(123) ; x = [] +persons = (1..100).collect {["id": it, "name": "person ${it}", "age": rnd.nextInt(40) + 20]} +relations = (1..500).collect {[rnd.nextInt(persons.size()), rnd.nextInt(persons.size())]}. + unique().grep {it[0] != it[1] && !x.contains(it.reverse())}.collect { + x << it + minAge = Math.min(persons[it[0]].age, persons[it[1]].age) + knowsSince = new Date().year + 1900 - rnd.nextInt(minAge) + ["from": persons[it[0]].id, "to": persons[it[1]].id, "since": knowsSince] + } +[ "Number of persons": persons.size() +, "Number of unique relationships": relations.size() ] +---- + +Now, to create the `person` vertices and the `knows` edges between them it may look like a good idea to generate a +single graph-mutating traversal, just like this: + +[gremlin-groovy] +---- +t = g +for (person in persons) { + t = t.addV("person"). + property(id, person.id). + property("name", person.name). + property("age", person.age).as("p${person.id}") +} ; [] +for (relation in relations) { + t = t.addE("knows").property("since", relation.since). + from("p${relation.from}"). + to("p${relation.to}") +} ; [] +traversalAsString = org.apache.tinkerpop.gremlin.groovy.jsr223.GroovyTranslator.of("g").translate(t.bytecode) ; [] +[ "Traversal String Length": traversalAsString.length() +, "Traversal Preview": traversalAsString.replaceFirst(/^(.{104}).*(.{64})$/, '$1 ... $2') ] +---- + +However, this kind of traversal does not scale and it's prone to produce a `StackOverflowError`. This error can hardly be prevented +as it's a limit imposed by the JVM. The stack size can be increased using the `-Xss` JVM option, but that's not how the problem that's +discussed here, should be solved. The proper way to accomplish the same thing as in the traversal above is to inject the lists into +the traversal and process them from there. + +[gremlin-groovy] +---- +g.withSideEffect("relations", relations). + inject(persons).sideEffect( + unfold(). + addV("person"). + property(id, select("id")). + property("name", select("name")). + property("age", select("age")). + group("m"). + by(id). + by(unfold())). + select("relations").unfold().as("r"). + addE("knows"). + from(select("m").select(select("r").select("from"))). + to(select("m").select(select("r").select("to"))). + property("since", select("since")).iterate() +g +---- + +Obviously, these traversals are more complicated, but the number of steps is known and thus it's the best way to prevent an +unexpected `StackOverflowError`. Furthermore, shorter traversals reduce the (de)serialization costs when such a traversal is send +over the wire to a Gremlin Server. + +NOTE: Although the example was based on a graph-mutating traversal, the same rules apply for read-only and mixed traversals. + +[[unspecified-keys-and-labels]] +== Unspecified Keys and Labels + +Some Gremlin steps have optional arguments that represent keys (e.g. `valueMap()`) or labels (e.g. `out()`). In the prototyping +phase of a projects it's often convenient to use these steps without any arguments. However, in production code this is bad idea +and keys and labels should always be specified. Not only does it make the traversal easier to read for others, but it also ensures +that the application will not break if the schema changes at one point and the queries return completely different results. + +The following code block shows a few examples that are good for prototyping or graph discovery. + +[gremlin-groovy,modern] +---- +g.V().has("person","name","marko").out() +g.V().has("person","name","marko").out("created").valueMap() +g.V().has("software","name","ripple").inE().has("weight", gte(0.5)).outV().properties() +---- + +The next code block shows the same queries, but with specified keys and labels. + +[gremlin-groovy,existing] +---- +g.V().has("person","name","marko").out("created","knows") +g.V().has("person","name","marko").out("created").valueMap("name","lang") +g.V().has("software","name","ripple").inE("created").has("weight", gte(0.5)).outV(). + properties("name","age") +---- + +[[unnecessary-steps]] +== Unnecessary Steps + +There are quite a few steps and patterns that can be combined into a much shorter form. TinkerPop is trying to optimize queries, by +rewriting such patterns automatically using traversal optimization strategies. These strategies, however, do have a few preconditions +and under certain circumstance they will not attempt to rewrite a traversal. For example, if the traversal has path computations +enabled (e.g. by using certain steps, such as `path()`, `simplePath()`, `otherV()`, etc.), then the assumption is that all steps are +required in order to produce the desired path. + +An often seen anti-pattern is the one that explicitely traversers to an edge and then to a vertex without using any filters. + +[gremlin-groovy,modern] +---- +g.V().hasLabel("person").outE("created").inV().dedup() <1> +g.V().hasLabel("software").inE("created").outV().count() <2> +---- + +<1> The `created` edge is never really needed as the traversal only asks for all things that were created by all persons in the graph. + These "things" are only represented by the adjacent vertices, not the edges. +<2> This traversals counts the persons in the graph who created software. The interesting thing about this query is that it actually + doesn't need to traverse all the way to the `person` vertices to count them. In this case it's sufficient to count the edges + between the `software` and `person` vertices. The performance of this query pretty much depends on the particular provider + implementation, but counting incident edges is usually much faster than counting adjacent vertices. + +The next code block shows the two aforementioned queries properly rewritten. + +[gremlin-groovy,modern] +---- +g.V().hasLabel("person").out("created").dedup() +g.V().hasLabel("software").inE("created").count() +---- + +Another anti-pattern that is commonly seen is the chaining of `where()`-steps using predicates. Consider the following traversal: + +[gremlin-groovy,modern] +---- +g.V().as('a'). + both().where(lt('a')).by(id).as('b'). + both().where(lt('a')).by(id).where(gt('b')).by(id).as('c'). + not(both().where(eq('a'))). + select('a','b','c'). + by('name') +---- + +Ignoring the anti-patterns that were discussed before there's not much wrong with the traversal, but note the two chained `where()`-steps +(`where(lt('a')).by(id).where(gt('b'))).by(id)`). Both steps compare the id of the current vertex with the id of a previous vertex. These +two conditions can be combined on the predicate level. + +[gremlin-groovy,existing] +---- +g.V().as('a'). + both().where(lt('a')).by(id).as('b'). + both().where(lt('a').and(gt('b'))).by(id).as('c'). + not(both().where(eq('a'))). + select('a','b','c'). + by('name') +---- + +The `profile()` output of both queries should make clear why this is better than using two `where()`-steps. + +[gremlin-groovy,existing] +---- +g.V().as('a'). + both().where(lt('a')).by(id).as('b'). + both().where(lt('a')).by(id).where(gt('b')).by(id).as('c'). + not(both().where(eq('a'))). + select('a','b','c'). + by('name'). + profile() +g.V().as('a'). + both().where(lt('a')).by(id).as('b'). + both().where(lt('a').and(gt('b'))).by(id).as('c'). + not(both().where(eq('a'))). + select('a','b','c'). + by('name'). + profile() +---- + +[[unspecified-label-in-global-vertex-lookup]] +== Unspecified Label in Global Vertex lookup + +The severity of the anti-pattern described in this section heavily depends on the provider implementation. Throughout the TinkerPop +documentation the code samples often use traversals that start like this: + +[gremlin-groovy,modern] +---- +g.V().has('name','marko') +---- + +This is totally fine for TinkerGraph as it uses a very simplified indexing schema, e.g. every vertex that has a certain property is stored in +the same index. However, providers may prefer to use separate indexes for different vertex labels. This becomes more important as graphs grow +much larger over time (which is not what TinkerGraph is meant to do). Hence, any traversal that's going to be used in production code should +also specify the vertex label to prevent the query engine from searching every index for the provided property value. + +The easy fix for the initially mentioned query follows in the code block below. + +[gremlin-groovy,existing] +---- +g.V().hasLabel('person').has('name','marko') <1> +g.V().has('person','name','marko') <2> +---- + +<1> With the specified label the traversal still returns the same result, but it's much safer to use across different providers. +<2> Same as statement 1, but a much shorter form to improve readability. + +[[steps-instead-of-tokens]] +== Steps Instead of Tokens + +When child traversals contain a single step, there's a good chance that the step can be replaced with a token. These tokens are translated +into optimized traversals that execute much faster then their step traversal pendants. A few examples of single step child traversals are +shown in the following code block. + +[gremlin-groovy,modern] +---- +g.V().groupCount().by(label()) +g.V().group().by(label()).by(id().fold()) +g.V().project("id","label"). + by(id()). + by(label()) +g.V().choose(label()). + option("person", project("person").by(values("name"))). + option("software", project("product").by(values("name"))) +---- + +With tokens used instead of steps the traversals become a little shorter and more readable. + +[gremlin-groovy,existing] +---- +g.V().groupCount().by(label) +g.V().group().by(label).by(id) <1> +g.V().project("id","label"). + by(id). + by(label) +g.V().choose(label). + option("person", project("person").by("name")). + option("software", project("product").by("name")) <2> +---- + +<1> Note, that tokens use a `fold()` reducer by default. +<2> `by("name")` doesn't use a token, but falls into the same category as the String `"name"` is translated into an optimized traversal. + +But this is not all about readability; as initially mentioned, the use of tokens also improves the overall query performance as shown in +the `profile()` output below. + +[gremlin-groovy,existing] +---- +g.V().groupCount().by(label()).profile() // not using token +g.V().groupCount().by(label).profile() // using a token +g.V().group().by(label()).by(id().fold()).profile() // not using tokens +g.V().group().by(label).by(id).profile() // using tokens +g.V().project("id","label"). + by(id()). + by(label()).profile() // not using tokens +g.V().project("id","label"). + by(id). + by(label).profile() // using tokens +g.V().choose(label()). + option("person", project("person").by(values("name"))). + option("software", project("product").by(values("name"))). + profile() // not using tokens +g.V().choose(label). + option("person", project("person").by("name")). + option("software", project("product").by("name")). + profile() // using tokens +---- + +NOTE: Pay attention to all metrics in `profile()`'s output. The time difference does not always look significant, but that's +usually because TinkerGraph keeps everything in memory and thus even bad queries can sometimes perform surprisingly well. The more +important metric is the number of traversers that are used and the overall number of generated steps. diff --git a/docs/src/recipes/index.asciidoc b/docs/src/recipes/index.asciidoc index b74321e..56b3231 100644 --- a/docs/src/recipes/index.asciidoc +++ b/docs/src/recipes/index.asciidoc @@ -66,6 +66,10 @@ include::tree.asciidoc[] include::olap-spark-yarn.asciidoc[] += Anti-Patterns + +include::anti-patterns.asciidoc[] + = Implementation Recipes include::style-guide.asciidoc[]
