Re: [DISCUSS] Returning Side Effects

Stephen Mallette Mon, 08 Aug 2016 04:37:14 -0700

so - in retrospect the unified streaming model for results and side-effects
wasn't so awesome and things needed to get re-worked a bit. I changed it to
the other option we had which was to cache the side-effects on the server
and then return them on demand when requested. So the benefit is that we
get to only return data when requested:


gremlin> graph = RemoteGraph.open('conf/remote-graph.properties')
==>remotegraph[DriverServerConnection-localhost/127.0.0.1:8182 [graph=g]]
gremlin> g = graph.traversal()
==>graphtraversalsource[remotegraph[DriverServerConnection-localhost/
127.0.0.1:8182 [graph=g]], standard]
gremlin> t = g.V(1).aggregate('a').outE("knows").aggregate("b").inV()
==>v[2]
==>v[4]
gremlin> se = t.getSideEffects();[]
gremlin> se.keys()          // request the keys from the server and cache
locally for future calls
==>a
==>b
gremlin> se.get('a')          // get "a" side-effect from the server and
cache locally for future calls against "a"
==>v[1]
gremlin> se.get('b')          // get "b" side-effect from the server and
cache locally for future calls against "b"
==>e[7][1-knows->2]
==>e[8][1-knows->4]

The downside is that we have to hold the side-effects on the server in a
cache so there is some cost in doing that. I think that cost can be
mitigated though if a Traversal is treated like a resource that we close().
The close() can then trigger something to release the side-effects on the
server.

As far as the protocol goes, I didn't change a whole lot from what I
previously described actually but there are now multiple ops on the
TraversalOpProcessor for drivers to implement:

+ bytecode - send Traversal bytecode here and get the results from the
Traversal
+ keys - get the keys from the sideeffects on a previously executed
Traversal streamed back in the same way we ship back traversal results
currently (cached by the request id of the original traversal sent to
bytecode)
+ gather -  get the sideffects for a key - streamed back in the same way we
ship traversal results currently with same extra meta-data described in the
previous thread (i.e. the sideEffect/aggregateTo keys/values)
+ close - kill a particular set of side effects in the cache

The implementation should be fairly straightforward as the same streaming
protocol is used for return of results of keys/gather as is used for
bytecode (which is all the same as returning results from
Standard/SessionOpProcessor).







On Thu, Jul 28, 2016 at 7:12 PM, Stephen Mallette <[email protected]>
wrote:

> I have a rough cut of "returning side-effects" working on TINKERPOP-1278
> branch. I didn't bother making this change for REST at this time as I felt
> like it was more important and useful to have it run for websockets/NIO as
> the drivers that would ultimately power a RemoteConnection are generally
> written for that interface.
>
> gremlin> graph = RemoteGraph.open('conf/remote-graph.properties')
> ==>remotegraph[DriverServerConnection-localhost/127.0.0.1:8182 [graph=g]]
> gremlin>  g = graph.traversal()
> ==>graphtraversalsource[remotegraph[DriverServerConnection-localhost/
> 127.0.0.1:8182 [graph=g]], standard]
> gremlin> t = g.V(1).aggregate('a').outE("knows").aggregate("b").inV()
> ==>v[2]
> ==>v[4]
> gremlin> t.getSideEffects().get('a')
> ==>v[1]
> gremlin> t.getSideEffects().get('b')
> ==>e[7][1-knows->2]
> ==>e[8][1-knows->4]
>
> It was more effort than i expected to get this to work mostly because of
> my attempts to do it all without breaking change. It was also interesting
> (and nice) to see that the protocol didn't need to change structurally for
> this to work, however, drivers will need to adjust a bit to deal with the
> side-effects now streaming back following results. Note that this only
> matters for those drivers who support submitting Traversals as Bytecode
> (which I assume is "none" of them) and existing script submissions should
> still have he same behavior and thus a terminating stream with the final
> result (side effects left on the server as always).
>
> To allow for side-effects to come back I added two pieces of metadata to a
> ResponseMessage:
>
> 1. sideEffect - which is the value of the side effect key. for instance in
> the above example, there would be values for "a" and "b" at different
> points in the stream
> 2. aggregateTo - which will be one of map, list, bulkset, or none. the
> significance here is that we needed a way to to tell the client how a batch
> of results should be re-assembled. recall that Gremlin Server iterates
> everything. If you return a String it puts the String into an Iterator for
> the response. There needed to be a way to say that a particular sideeffect
> was converted to iterator so that it could be re-assembled (or not) to what
> the original type was.
>
> As for the streaming model, Gremlin Server iterates the results first and
> then the side effects by key. Recall that a ResponseMessage batches up
> results returned from the server based on iteration size. I've arranged it
> so that a ResponseMessage will never mix results with side effects or one
> side-effect key with another key. In this way, it's easy to tie the
> sideEffect/aggregateTo values to the data within the message. That made it
> pretty easy for me to assemble the stream of side-effects into something
> useful on the client side.
>
> There is still a lot to do here:
>
> 1. Lots of code cleanup to say the least - Some of the basic interfaces,
> classes, etc that i added may see some change as i review with a fresh mind
> tomorrow.
> 2. I'd like to make it optional to return side-effects so that drivers or
> users can choose to opt-out of the expense of sending that information back
> if it isn't needed somehow
> 3. Piggy-backing on 2, as mentioned earlier in this thread, i think it
> would be nice if you could actively state as a user which side-effects you
> wanted sent back when you submit the traversal. not sure where that would
> be specified right now given the way everything is hooked together.
> 4. Documentation is non-existent at this point beyond what i've tried to
> lay out in this thread so I gotta get to that when all the change settles
> down. I assume that won't happen until Marko gets back from his time off as
> I suspect he'll think of a few extra things to do in making this all work
> well.
>
> Anyway, please let me know if there are any thoughts on this approach.
>
>
>
>
>
> On Fri, Jul 22, 2016 at 6:24 PM, Stephen Mallette <[email protected]>
> wrote:
>
>> Yes, I expected to return results first and then stream the side-effects.
>>
>> On Fri, Jul 22, 2016 at 5:05 PM, Dylan Millikin <[email protected]
>> > wrote:
>>
>>> > Perhaps nicer than doing all that trickery with transactions would be
>>> to
>>> self-detach the vertex ahead of time
>>>
>>> This was the original idea, I never dove too deep into it as the
>>> sideEffects were applied mid traversal and extra filtering/SEs still had
>>> to
>>> occur. I wasn't sure it was actually possible and the transaction hack
>>> allowed me to move on.
>>>
>>> As for the GLV limitations, it's mostly going to be network overhead.
>>> Unfortunately one round trip with the server is costly and I know that
>>> we've ended up having to be creative in order to limit the round trips by
>>> concatenating scripts for each query. A GLV approach would need some
>>> careful planing and probably a multiline byteCode feature. But I digress
>>> that's not what this thread is about.
>>>
>>> In the spirit of GLVs returning side effects how would your original
>>> proposition stream over the network? Would you get all data first and
>>> then
>>> SE? I'm guessing you would want to stream the SEs as well.
>>>
>>> On Fri, Jul 22, 2016 at 4:42 PM, Stephen Mallette <[email protected]>
>>> wrote:
>>>
>>> > > You can take the case of a group count as a really simple example.
>>> >
>>> > So you want the side-effect in the Vertex itself so you can use it
>>> with the
>>> > ORM. Interesting. Perhaps nicer than doing all that trickery with
>>> > transactions would be to self-detach the vertex ahead of time (i.e.
>>> create
>>> > a DetachedVertex) and add the property you want. As indirect as that
>>> > sounds, that seems more direct to me than the "fake" transaction. Not
>>> sure
>>> > that what I'm doing here will help you with that problem.
>>> >
>>> > > I'll add that I'm looking at this from a non-GLV perspective so I'm
>>> > disregarding object mapping done through GraphSONv2.0 typing in favor
>>> of a
>>> > format guarantied result set (say that either only contains vertices,
>>> >  edges, or a combination of both).
>>> >
>>> > Also interesting. Not sure that kind of serialization has a place in
>>> > TinkerPop where we encourage folks to return everything under the sun
>>> by
>>> > using Gremlin to return data in a form that suits their required end
>>> > result. if this is the outcome you want, I think that my suggestion
>>> with
>>> > self-detaching is probably on the right track. Maybe consider a custom
>>> > serializer that coerces all results to a graph elements. That would
>>> take
>>> > care of all the embedded objects and the whole lot.
>>> >
>>> > > The reason for this is that GLV is too
>>> > inefficient for larger projects so a more traditional script->result
>>> > approach is required.
>>> >
>>> > I'm hijacking my own thread by going too deep down this path, but I
>>> think
>>> > we should strive toward a solution for GLVs to be robust enough for
>>> > developers to be successful with TinkerPop in the language of their
>>> choice.
>>> > Just like we'll never get rid of all lambdas in Gremlin, we will
>>> probably
>>> > never quite get rid of script->result for all use cases (but, again,
>>> like
>>> > lambdas the goal will be to get quite close). I find it quite
>>> interesting
>>> > that we might be able to figure out how a python dev could write
>>> Gremlin in
>>> > python that would remotely execute on the server seamlessly, however
>>> it's
>>> > also interesting that that same GLV code could be treated as
>>> server-side to
>>> > be accessed by from a python client. In that way, heavy complex logic
>>> (the
>>> > type you are talking about) could be written in python and then
>>> accessed
>>> > from python on the client. In short, i think that it would be better to
>>> > prefer to think of the work around GLVs as "how to make Gremlin good in
>>> > other languages" rather than the more narrow view of just "remoting
>>> > traversals".  If we go wider, we might come up with some good ideas to
>>> > really broaden access to TinkerPop and graphs in a very big way.
>>> >
>>> > We already have a really big improvement with "remoting" as compared to
>>> > good 'ol RexsterGraph - so that's something  - haha  ;)
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Fri, Jul 22, 2016 at 3:17 PM, Dylan Millikin <
>>> [email protected]>
>>> > wrote:
>>> >
>>> > > Yeah sorry I left out an important part. This is especially an issue
>>> when
>>> > > you're dealing with an ORM layer that's expecting results of a
>>> specific
>>> > > type (for example vertices).
>>> > > You can take the case of a group count as a really simple example.
>>> Your
>>> > > result set could be :
>>> > >
>>> > > [{count:5, vertex:v[1]}, {count:3, vertex:v[2]}, {count:1,
>>> vertex:v[3]}]
>>> > > and this is easy enough to do with gremlin. But unless this is built
>>> into
>>> > > the ORM itself chances are you'll need to implement the object
>>> mapping
>>> > > yourself.
>>> > >
>>> > > The alternative is to add "count" as a property of vertex and then
>>> you
>>> > can
>>> > > leverage all available features from your ORM such as filtering,
>>> > ordering,
>>> > > etc... Actually, the way we did it above we can also do those
>>> directly in
>>> > > gremlin as well.
>>> > >
>>> > > This is a simple case, but once it gets more complicated with
>>> > hierarchical
>>> > > data, the option of implementing the object mapping yourself is just
>>> a
>>> > > headache and often times less efficient than just rolling back a
>>> > > transaction.
>>> > >
>>> > > Dunno if that was clear enough this time around.
>>> > >
>>> > > I'll add that I'm looking at this from a non-GLV perspective so I'm
>>> > > disregarding object mapping done through GraphSONv2.0 typing in
>>> favor of
>>> > a
>>> > > format guarantied result set (say that either only contains vertices,
>>> > >  edges, or a combination of both). The reason for this is that GLV
>>> is too
>>> > > inefficient for larger projects so a more traditional script->result
>>> > > approach is required.
>>> > >
>>> > > On Fri, Jul 22, 2016 at 2:09 PM, Stephen Mallette <
>>> [email protected]>
>>> > > wrote:
>>> > >
>>> > > > hi dylan, could you please provide a more concrete example of the
>>> > problem
>>> > > > you're facing?
>>> > > >
>>> > > > On Fri, Jul 22, 2016 at 1:24 PM, Dylan Millikin <
>>> > > [email protected]>
>>> > > > wrote:
>>> > > >
>>> > > > > I'm going to confirm that this is actually a common issue.
>>> > > > > One thing to keep in mind is that often times the sideEffects are
>>> > > > directly
>>> > > > > linked to returned elements on a 1 --> n basis which neither of
>>> the
>>> > > above
>>> > > > > really help with. That is to say that if you're streaming your
>>> > results
>>> > > > > you'll need the sideEffects that relate to the streamed element.
>>> > > > >
>>> > > > > There is no easy way of handling this currently. Especially if
>>> you
>>> > > order
>>> > > > > your results and get unordered sideEffect results.
>>> > > > > One way we've found to work around this is very hacky, not
>>> efficient
>>> > > and
>>> > > > > only works for non mutating queries:
>>> > > > >
>>> > > > > - we start a transaction
>>> > > > > - we append the sideEffect data to the elements we're emitting
>>> (say
>>> > as
>>> > > > > properties of a vertex)
>>> > > > > - get the full result set with sideEffects as properties of the
>>> > result
>>> > > > > elements.
>>> > > > > - rollback transaction so properties are not persisted to the
>>> graph.
>>> > > > >
>>> > > > > A truly wicked succession of events born from absolute
>>> desperation.
>>> > > > > I enquired a while back about the ability to treat elements as
>>> > detached
>>> > > > > from the graph in order to do the above without the transaction
>>> > > handling.
>>> > > > > But I never followed up.
>>> > > > >
>>> > > > > I figured I would put this out there as another case where
>>> non-Java
>>> > > > > languages struggle.
>>> > > > >
>>> > > > > On Thu, Jul 21, 2016 at 1:19 PM, Stephen Mallette <
>>> > > [email protected]>
>>> > > > > wrote:
>>> > > > >
>>> > > > > > Your way made me think that if you wrote your traversal like
>>> that,
>>> > > you
>>> > > > > > would return the side-effects twice - once in your traversal as
>>> > part
>>> > > of
>>> > > > > the
>>> > > > > > standard result and then again as a side-effect.  Not sure what
>>> > that
>>> > > > > means
>>> > > > > > - just a thought.
>>> > > > > >
>>> > > > > > While I'm thinking thoughts that may or may not be obvious, it
>>> also
>>> > > > > occurs
>>> > > > > > to me that the downside for a GLV retrieving data that way is
>>> that
>>> > > the
>>> > > > > > result of the traversal won't be streamed back. It will
>>> aggregate
>>> > the
>>> > > > > > result (and the side-effects naturally) in memory and then
>>> return
>>> > > that
>>> > > > > all
>>> > > > > > as a whole.
>>> > > > > >
>>> > > > > > On Thu, Jul 21, 2016 at 11:24 AM, Daniel Kuppitz
>>> <[email protected]>
>>> > > > > wrote:
>>> > > > > >
>>> > > > > > > If you really want to have your result and your side-effects
>>> > > returned
>>> > > > > by
>>> > > > > > a
>>> > > > > > > single request, you could do something like this:
>>> > > > > > >
>>> > > > > > > gremlin>
>>> > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> > g.V(1,2,4).aggregate("names").by("name").aggregate("ages").
>>> by("age")*.fold().as("data").select("data",
>>> > > > > > > "names", "ages")*
>>> > > > > > > ==>[data:[v[1], v[2], v[4]], names:[marko, vadas, josh],
>>> > ages:[29,
>>> > > > 27,
>>> > > > > > 32]]
>>> > > > > > > gremlin>
>>> > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> > g.V(1,2,4).aggregate("names").by("name").aggregate("ages").
>>> by("age")*.fold().project("data",
>>> > > > > > > "se").by().by(cap("names","ages"))*
>>> > > > > > > ==>[data:[v[1], v[2], v[4]], se:[names:[marko, vadas, josh],
>>> > > > ages:[29,
>>> > > > > > 27,
>>> > > > > > > 32]]]
>>> > > > > > > gremlin>
>>> > > > > g.V(1,2,4).aggregate("names").by("name")*.fold().project("data",
>>> > > > > > > "se").by().by(cap("names"))*
>>> > > > > > > ==>[data:[v[1], v[2], v[4]], se:[marko, vadas, josh]]
>>> > > > > > >
>>> > > > > > > I'm not saying it would be bad to have Gremlin Server handle
>>> that
>>> > > for
>>> > > > > > you,
>>> > > > > > > just wanted to show that it's actually pretty easy to get the
>>> > data
>>> > > > and
>>> > > > > > the
>>> > > > > > > side-effects without using the traversal admin methods
>>> (hence it
>>> > > > should
>>> > > > > > > work for all GLVs).
>>> > > > > > >
>>> > > > > > > Cheers,
>>> > > > > > > Daniel
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > On Thu, Jul 21, 2016 at 4:51 PM, Stephen Mallette <
>>> > > > > [email protected]>
>>> > > > > > > wrote:
>>> > > > > > >
>>> > > > > > > > As we look to build out GLVs and expand Gremlin into other
>>> > > > > programming
>>> > > > > > > > languages, one of the important aspects of doing this
>>> should be
>>> > > to
>>> > > > > > > consider
>>> > > > > > > > consistency across GLVs. We should try to prevent
>>> capabilities
>>> > of
>>> > > > > Java
>>> > > > > > > from
>>> > > > > > > > being lost in Python, JS, etc.
>>> > > > > > > >
>>> > > > > > > > As we look at both RemoteGraph in Java and gremlin-python
>>> we
>>> > find
>>> > > > > that
>>> > > > > > > > there is no way to get traversal side-effects. If you
>>> write a
>>> > > > > Traversal
>>> > > > > > > and
>>> > > > > > > > want side-effects from it, you have to write your
>>> traversal to
>>> > > > return
>>> > > > > > > them
>>> > > > > > > > so that it comes back as part of the result set. Since
>>> > > RemoteGraph
>>> > > > > and
>>> > > > > > > > gremlin-python don't really allow you to directly "submit a
>>> > > script"
>>> > > > > > it's
>>> > > > > > > > not as though you can execute a traversal once for both the
>>> > > result
>>> > > > > and
>>> > > > > > > the
>>> > > > > > > > side-effect and package them together in a single request
>>> as
>>> > you
>>> > > > > might
>>> > > > > > do
>>> > > > > > > > with a simple script request:
>>> > > > > > > >
>>> > > > > > > > $ curl -X POST -d
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> > "{\"gremlin\":\"t=g.V(1).values('name').aggregate('x');
>>> [v:t.toList(),se:t.getSideEffects().get('x')]\"}"
>>> > > > > > > > http://localhost:8182
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> > {"requestId":"3d3258b2-e421-459a-bf53-ea1e58ece4aa","
>>> status":{"message":"","code":200,"attributes":{}},"result":
>>> {"data":[{"v":["marko"]},{"se":["marko"]}],"meta":{}}}
>>> > > > > > > >
>>> > > > > > > > I'm thinking that we could alter things in a non-breaking
>>> way
>>> > to
>>> > > > > allow
>>> > > > > > > > optional return of side-effect data so that there is a way
>>> to
>>> > > have
>>> > > > > this
>>> > > > > > > all
>>> > > > > > > > streamed back without the need for the little workaround I
>>> just
>>> > > > > > > > demonstrated. For REST I think we could just include a
>>> > sideEffect
>>> > > > > > request
>>> > > > > > > > parameter that allowed for a list of side-effect keys to
>>> > return.
>>> > > > > > Perhaps
>>> > > > > > > > the a "*" could indicate that all should be returned.  the
>>> > > > > side-effects
>>> > > > > > > > could be serialized into a key sibling to "data" called
>>> > > > "sideEffect".
>>> > > > > > > >
>>> > > > > > > > I think a similar approach could be used for websockets
>>> and NIO
>>> > > > where
>>> > > > > > we
>>> > > > > > > > could amend the protocol to accept that sideEffect
>>> parameter.
>>> > We
>>> > > > > would
>>> > > > > > > > first stream results (marked with meta data to specify a
>>> > > "result")
>>> > > > > and
>>> > > > > > > then
>>> > > > > > > > stream side effects (again marked with meta data as such).
>>> > > > > > > >
>>> > > > > > > > I considered caching the Traversal instances so that a
>>> future
>>> > > > request
>>> > > > > > > could
>>> > > > > > > > get the side effects, but for a variety of reasons I
>>> abandoned
>>> > > that
>>> > > > > > (the
>>> > > > > > > > cache meant more heap and trying to get the right balance,
>>> new
>>> > > > > > > transactions
>>> > > > > > > > would have to be opened if the side-effect contained graph
>>> > > > elements,
>>> > > > > > > etc.)
>>> > > > > > > >
>>> > > > > > > > I like the approach of just maintaining our single
>>> > > request-response
>>> > > > > > model
>>> > > > > > > > with the changes I proposed above.It seems to provide the
>>> least
>>> > > > > impact
>>> > > > > > > with
>>> > > > > > > > no new dependencies, is backward compatible and could be
>>> > > completely
>>> > > > > > > > optional to RemoteConnections.
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: [DISCUSS] Returning Side Effects

Reply via email to