[
https://issues.apache.org/jira/browse/TINKERPOP-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133224#comment-15133224
]
ASF GitHub Bot commented on TINKERPOP-962:
------------------------------------------
GitHub user okram opened a pull request:
https://github.com/apache/incubator-tinkerpop/pull/210
TINKERPOP-962: Provide "vertex query" selectivity when importing data in
OLAP.
TINKERPOP-962: Provide "vertex query" selectivity when importing data in
OLAP.
https://issues.apache.org/jira/browse/TINKERPOP-962
(For TinkerPop 3.2.0 -- Breaking Change for GraphComputer Implementations)
This feature enables us to push down a `GraphFilter` predicate to the
underlying OLAP graph system. For instance, if `g.V().count()` is executed by
`SparkGraphComputer`, then there is no reason to load all the edges, simply
push down a `GraphFilter`-predicate that filters out edges. For graph database
providers like Titan, they can simply only send up the subset of the graph that
is required for the OLAP job instead of filtering on the OLAP cluster machines.
In the future, we will provide `GraphFilterTraversalStrategy` which will
analyze the traversal and automatically generate a `GraphFilter` so the user is
blind to which subsets of the full graph are actually being accessed by the
OLAP engine.
This pull request yields a breaking change for graph system providers that
have their own `GraphComputer` implementation. There are two new methods on
`GraphComputer` and one new method on `GraphReader`.
```
GraphComputer vertices(Traversal<Vertex,Vertex> vertexFilter)
GraphComputer edges(Traversal<Vertex,Edge> edgeFilter)
GraphReader.readVertex(InputStream inputStream, GraphFilter graphFilter)
```
TinkerPop provides a `GraphFilter` object that does a lot of the heavy
lifting so at minimum, the graph system provider simply needs to
`GraphFilter.isLegal()` the vertices and edges it loads. Note that if the graph
system provider relies on `GiraphGraphComputer` or `SparkGraphComputer`, then
there is no change on their part unless they want to leverage the `GraphFilter`
locally before sending their data to Giraph or Spark (an optimization that can
be done at a later date without impacting users).
There was a host of changes that took place for this feature to be created.
When merged, the `CHANGELOG.txt` will have the following new items:
```
* Added `GraphFilter` to support filtering out vertices and edges that
won't be touched by an OLAP job.
* Added `GraphComputer.vertices()` and `GraphComputer.edges()` for
`GraphFilter` construction (*breaking*).
* `SparkGraphComputer`, `GiraphGraphComputer`, and `TinkerGraphComputer`
all support `GraphFilter`.
* Added `GraphComputerTest.shouldSupportGraphFilter()` which verifies all
filtered graphs have the same topology.
* Added `GraphFilterAware` interface to `hadoop-gremlin/` which tells the
OLAP engine that the `InputFormat` handles filtering.
* `GryoInputFormat` and `ScriptInputFormat` all implement
`GraphFilterAware`.
* Fixed a bug in `TraversalUtil.isLocalStarGraph()` which allowed certain
illegal traversals to pass.
* Added `TraversalUtil.isLocalVertex()` to verify that the traversal does
not touch incident edges.
* `GraphReader` IO interface now has `Optional<Vertex>
readGraph(InputStream, GraphFilter)`. Default `UnsupportOperationException`.
* `GryoReader` does not materialize edges that will be filtered out and
this greatly reduces GC and load times.
* Created custom `Serializers` for `SparkGraphComputer` message-passing
classes which reduce graph sizes significantly.
```
Ran `mvn clean install` and integration tests. Passed.
VOTE +1.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-962
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-tinkerpop/pull/210.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #210
----
commit 873174e8218aef31f2220928ab16463aeda650cd
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-01T16:29:14Z
Started working on GraphComputer.vertices() and GraphComputer.edges(). Have
it working (untested) for SparkGraphComputer. The same pattern will flow over
to GiraphGraphComputer. There are some issues regarding semantics in
TinkerGraphComputer. Will bring up with a [DISCUSS].
commit 3b3e008ce03d1f63610b92ff79886376d9dc55f7
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-01T19:38:42Z
GraphComputerTest now verifies that graph filters work --
GraphComputer.vertices() and GraphComputer.edges(). SparkGraphComputer
implements graph filters correctly. TinkerGraph and Giraph throw
UnsupportOperationException at this point (i.e. TODO). Had to add remove()
methods to many of the inner Iterator anonymous classes in IteratorUtils and
MultiIterator. Basically, they just call remove() on the wrapped iterator.
Thus, cleanly backwards compatible. Added GraphFilterAware interface will allow
InputFormats to say whether or not they do vertex/edge-filtering on graph load.
Nothing connected to that yet, but GryoInputFormat (and smart providers) will
be able to leverage this interface. Still a work in progress....
commit 3485d8454855938fd7c0c24d5c3f9c3eb6ab308a
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-01T21:45:48Z
Created a CommonFileInputFormat abstract class that both GryoInputFormat
and ScriptInputFormat now extend. It handles all vertex/edge filter
construction and has helper methods for filtering the StarVertex prior to being
fully loaded by the InputFormat. This is really nice as we can now tweak vertex
loading to a pretty intense degree especially with GryoInputFormat (e.g. once
properties are loaded, check vertex filter and thus, don't even deserialize the
edges). How it is right now, the full Vertex is materialized, then validated
before the InputFormat will nextKeyValue().
commit 77732ddd5f60bbd65a445390e590da34bea1db2f
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-01T21:59:04Z
tweaks to filtered boolean check.
commit 64c684065143b75697ccac755b9dfbf943c8c54c
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-01T23:46:07Z
GiraphGraphComputer now has support for vertexFilters and edgeFilters.
Consolidated a bunch of code to make it easy for future InputFormats to be
GraphFilterAware. Will most likely make a filterMap so variables are bundled
nicely.
commit bc417dbf01fee817aa325ee8e4b582fef8ab6788
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T00:44:19Z
created a GraphFilter container object that makes storing and applying
filters easy. Very clean model. GraphFilter will next contain stuff like
inferences on the filters so easy push-down predicates are available to the
graph system provider.
commit d0ac65277702b703c1ab2257adcbf67b0699b959
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T15:55:14Z
GraphFilter is now a really cool class. It is part of gremlin-core/computer
and provides access to GraphComputer vertices() and edges() load filters. It
also provides direct support for filtering StarVertex vertices (as most OLAP
systems will leverage StarVertex). Its StarVertex support is nice in that
GraphFilter analyzes the edgesFilter and can do bulk dropEdges() to prune the
StarVertex fast. Whatever it can't do in bulk, it then runs the edgeFilter over
the remaining edges. GraphComputerTest.shouldSupportGraphFilter() ensures that
the graph is properly pruned. I have some ideas about pushing GraphFilter down
to the StarVertex deserializer, but will need @spmallette help on that. If we
can do that, then we can get some BLAZING speeds for highly pruned OLAP
operations.
commit eee16c9354602a49e7dfb7738f2ce4d9fe36152c
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T19:17:59Z
TinkerGraph now supports GraphComputer GraphFilter. Sort of an elegant
solution that makes use of tagging elements that are legal or not. As of right
now, the full test suite passes (integration too). GraphFilter works -- this is
going to be huge for speeding up OLAP times.
commit 7ad48f20586ec58b1fea7018fa8f37ec8c95c9b9
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T19:44:57Z
added a MapReduce test. We now verify that GraphFilter works for both
VertexProgram+MapReduce and MapReduce only. TinkerGraph and Spark integration
tests pass.
commit 72e388c4a1eadb6654a422988857006ed27b6158
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T19:53:54Z
added nice GraphFilter.legalVertex() and GraphFilter.legalEdges() methods
so that the provider doesn't have to be smart about how to apply the underlying
filter traversal.
commit e4cf925b496ee250f7dca48d094e8b93816ca075
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T20:24:45Z
Added a state-based test case to GraphFilter. About to run this thing on
the Blade cluster against Friendster to see how well we do now.
commit 7023987a0b5154646a4d77e0b2b3506e850ed3d2
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T20:35:27Z
Forgot to add vertices() and edges() to the
ComputerTraversalEngine.Builder. I can't wait for this model to go away in
favor of a fluent TraversalSource.
commit 6cfb1f22f43fa82be10d04fc28e86e8f3db9d28e
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T21:24:38Z
found a bug in TraversalUtil.isLocalStarGraph(). Added
TraversalUtil.isLocalVertex() (for only checking properties -- no edge access).
Added JavaDoc to new GraphComputer methods. Added verfication that the provided
traversals don't leave their respective boundaries.
commit f7ad5c4f6a7b197cebb86fa22d4c263ce6b3365b
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-02T21:49:22Z
Added standard GraphComputer.Exceptions for GraphFilter and verfiy
Exceptions are thrown correctly in GraphComputerTest. Tweaks to JavaDoc.
commit b824d0c0994276e3714dc59341aa24526127eafe
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-03T18:03:23Z
Created specialized serializers for common classes in Spark to avoid the
overhead of JavaSerialization.
commit 4afe29a80fb15f965924297fafda942adeb36b06
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-03T18:09:46Z
forgot a Serialization that popped up when taking things to the cluster.
commit 1c9a31c4c3d3f09c829d135363ad7ebff6590c8d
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-03T19:49:53Z
Learned about ExternalizableSerializer which makes registration of Kryo
serializers alot more simple. Ran this code on the cluster -- what took 25
minutes now takes 6.8 minutes.
commit 097e09a39a151e6dbb8ebb268bc1792baac8765a
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-03T20:48:31Z
minor nothings.
commit 001a13dec5d3bb7ffa269fa2e392947d5c600a5e
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-03T21:28:59Z
Merge branch 'master' into TINKERPOP-962
commit 569496f671f4e532fc459cee54da3e6e62522ac1
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-04T14:36:31Z
Merge branch 'master' into TINKERPOP-962
commit 07f7a8c614493de4bd13d2e75292609c5ee7183c
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-04T17:42:24Z
Moved GraphFilterTest to gremlin-groovy/ so I can use reflection and not
have to make internal variables protected for testing purposes.
Optional<Vertex> GraphReader.readVertex(InputStream,GraphFilter) now exists at
the interface level with an UnsupportedOperationException default. GryoReader
can now read vertices from a GraphFilter-perspective and only materialize those
vertices/edges that are legal. Should be fairly trivial to add to
GraphSONReader.
commit b3d3116e5f287e61d44993af3e709c7d04bf77ac
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-04T20:00:00Z
was using null to represent a filtered vertex. went with Optional
throughout so the API is consistent.
commit a28b1fdc673bb6a11b741d306e3706efc4510592
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-04T20:23:12Z
method rename. pointless twiddling.
commit 25e5b24049ef22d1bb64ae652d6ff5cba4786451
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-04T20:51:12Z
ensure that the context is closed after the test suite has completed.
commit ed18cd9382ee1e2db7f4618a72e9d28ed6b2fb2a
Author: Marko A. Rodriguez <[email protected]>
Date: 2016-02-04T22:51:08Z
OMG, the most insane bug for the last two hours. Painfull......
----
> Provide "vertex query" selectivity when importing data in OLAP.
> ---------------------------------------------------------------
>
> Key: TINKERPOP-962
> URL: https://issues.apache.org/jira/browse/TINKERPOP-962
> Project: TinkerPop
> Issue Type: Improvement
> Components: process
> Affects Versions: 3.1.0-incubating
> Reporter: Marko A. Rodriguez
> Assignee: Marko A. Rodriguez
> Labels: breaking
> Fix For: 3.2.0-incubating
>
>
> Currently, when you do:
> {code}
> graph.compute().program(PageRankVertexProgram).submit()
> {code}
> We are pulling the entire {{graph}} into the OLAP engine. We should allow the
> user to limit the amount of data pulled via "vertex query"-type filter. For
> instance, we could support the following two new methods on {{GraphComputer}}.
> {code}
> graph.compute().program(PageRankVertexProgram).vertices(hasLabel('person')).edges(out,
> hasLabel('knows','friend').has('weight',gt(0.8)).submit()
> {code}
> The two methods would be defined as:
> {code}
> public interface GraphComputer {
> ...
> GraphComputer vertices(final Traversal<Vertex,Vertex> vertexFilter)
> GraphComputer edges(final Direction direction, final Traversal<Edge,Edge>
> edgeFilter)
> {code}
> If the user does NOT provide a {{vertices()}} (or {{edges()}}) call, then the
> {{Traversal}} is assumed to be {{IdentityTraversal}}. Finally, in terms of
> execution order, first {{vertices()}} is called and if "false" then don't
> call edges. Else, call edges on all the respective incoming and outgoing
> edges. Don't really like {{Direction}} there and perhaps its just:
> {code}
> GraphComputer edges(final Traversal<Vertex,Edge> edgeFilter)
> {code}
> And then all edges that pass through are added to OLAP vertex. You don't want
> {{both}}? Then its {{outE('knows',friend').has('weight',gt(0.8))}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)