I'll try to keep this simple, as serialization tends to be anything but
simple....

Forgetting GraphML which has its own rules, GraphSON and Gryo are the two
key serialization modules that we have in IO.  We use these for both
serialization to disk as well as serialization over the network in Gremlin
Server. If you issue a request like:

g.V()

it returns vertices obviously. For both Gryo and GraphSON, those vertices
are converted to DetachedVertex which includes the properties of the
Vertex. This can be tremendously expensive, especially if the graph
supports multi-properties.

I think that Gremlin Server should take a hint from OLAP in relation to
this issue. With OLAP, a Vertex is converted to a ReferenceVertex where we
only get the element identifier passed around.

gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat],
sparkgraphcomputer]
gremlin> l = g.V().toList();[]
gremlin> l[0].class
==>class
org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceVertex

If you want more information, it is up to you to issue your query to
request that information - for example:

g.V().valueMap(true)

I think Gremlin Server should work in the same fashion (i.e. return a
ReferenceVertex when a Vertex is serialized over the network).  It would
ease up on serialization overhead and force users to be more explicit about
the data that they want which would prevent unnecessary performance
surprises. This change might also be nice for the efficiency of
RemoteGraph/Connection implementations.

This has bothered me for a while, but we carried over the pattern from
TinkerPop 2.x of sending back properties and I've been concerned about
introducing a break in trying to improve that.  I dug into it more today
and my analysis seems to indicate that this change can occur without
breaking all the code that's currently out there. I think that we could
keep the existing serialization model and simply add in the ReferenceVertex
approach as a configuration option for 3.2.1 and then make it the default
for 3.3.x.

If there are no objections in the next 72 hours (Saturday, May 21, 2016,
4pm EST) I'll assume lazy consensus and move forward.

Reply via email to