On Wed, Apr 20, 2011 at 11:25 AM, Craig Taverner <cr...@amanzi.com> wrote:

> I think sorting would need to be optional, since it is likely to be a
> performance and memory hug on large traversals. I think one of the key
> benefits of the traversal framework in the Embedded API is being able to
> traverse and 'stream' a very large graph without occupying much memory. If
> this can be achieved in the REST API (through pagination), that is a very
> good thing. I assume the main challenge is being able to freeze a traverser
> and keep it on hold between client requests for the next page. Perhaps you
> have already solved that bit?
>

While I agree with you that the ability to effectively stream the results of
a traversal is a very useful thing, I don't like the persisted traverser
approach, for several reasons. I'm sorry if my tone below is a bit harsh, I
don't mean it that way, I simply want to make a strong case for why I think
the hard way is the right way in this case.

First, the only good restful approach I can think of for doing persisted
traversals would be to "create" a traversal resource (since it is an object
that keeps persistent state), and get back an id to refer to it. Subsequent
calls to paged results would then be to that traversal resource, updating
its state and getting results back. Assuming this is the correct way to
implement this, it comes with a lot of questions. Should there be a timeout
for these resources, or is the user responsible for removing them from
memory? What happens when the server crashes and the client can't find the
traversal resources it has ids for?

If we somehow solve that or find some better approach, we end up with an API
where a client can get paged results, but two clients performing the same
traversal on the same data may get back the same result in different order
(see my comments on sorting based on expected traversal behaviour below).
This means that the API is really only useful if you actually want to get
the entire result back. If that was the problem we wanted to solve, a
streaming solution is a much easier and faster approach than a paging
solution.

Second, being able to iterate over the entire result set is only half of the
use cases we are looking to solve. The other half are the ones I mentioned
examples of (the blog case, presenting lists of things to users and so on),
and those are not solved by this. Forcing users of our database to pull out
all their data over the wire and sort the whole thing, only to keep the
first 10 items, for each user that lands on their frontpage, is not ok.

Third, and most importantly to me, using this case to put more pressure on
ourselves to implement real sorting is a really good thing. Sorting is
something that *really* should be provided by us, anyone who has used a
modern database expects this to be our problem to solve. We have a really
good starting point for optimizing sorting algorithms, sitting as we are
inside the kernel with our caches and indexes :)


>
> In my opinion, I would code the sorting as a characteristic of the graph
> itself, in order to avoid having to sort in the server (and incur the
> memory/performance hit). So that means I would use a domain-specific
> solution to sorting. Of course, generic sorting is nice also, but make it
> optional.
>

I agree sorting should be an opt-in feature. Putting meta-data like sorting
order and similar things inside the graph I think is a matter of personal
preference, and for sure has its place as a useful optimization. I do,
however, think that the "official" approach to sorting needs to be based on
concepts familiar from other databases - define your query, and define how
you want the result sorted. If indexes are available the database can use
them to optimize the sorting, otherwise it will suck, but at least we're
doing what the user wants us to do. All lessons learned in YesSQL databases
(see what I did there?) should not be unlearned :)

Also, the approach of sorting via the traversal itself assumes knowledge of
which order the traverser will move through the graph, and that is not
necessarily something that will be the same in later releases. Tobias was
talking about cache-first traversals as an addition or even a replacement to
depth/breadth first ones, a major optimization we cannot do if we encourage
people to sort "inside" the graph.

/Jake


>
> On Wed, Apr 20, 2011 at 11:19 AM, Jacob Hansson <ja...@voltvoodoo.com
> >wrote:
>
> > On Tue, Apr 19, 2011 at 10:17 PM, Michael DeHaan
> > <michael.deh...@gmail.com>wrote:
> >
> > > On Tue, Apr 19, 2011 at 10:58 AM, Jim Webber <j...@neotechnology.com>
> > > wrote:
> > > >>> I'd like to propose that we put this functionality into the plugin
> (
> > > https://github.com/skanjila/gremlin-translation-plugin) that Peter and
> I
> > > are currently working on, thoughts?
> > > >
> > > > I'm thinking that, if we do it, it should be handled through content
> > > negotiation. That is if you ask for application/atom then you get paged
> > > lists of results. I don't necessarily think that's a plugin, it's more
> > > likely part of the representation logic in server itself.
> > >
> > > This is something I've been wondering about as I may have the need to
> > > feed very large graphs into the system and am wondering how the REST
> > > API will hold up compared to the native interface.
> > >
> > > What happens if the result of an index query (or traversal, whatever)
> > > legitimately needs to return 100k results?
> > >
> > > Wouldn't that be a bit large for one request?   If anything, it's a
> > > lot of JSON to decode at once.
> > >
> > >
> > Yeah, we can't do this right now, and implementing it is harder than it
> > seems at first glance, since we first need to implement sorting of
> results,
> > otherwise the paged result will be useless. Like Jim said though, this is
> > another one of those *must be done* features.
> >
> >
> > > Feeds make sense for things that are feed-like, but do atom feeds
> > > really make sense for results of very dynamic queries that don't get
> > > subscribed to?
> > > Or, related question, is there a point where the result sets of
> > > operations get so large that things start to break down?   What do
> > > people find this to generally be?
> > >
> >
> > I'm sure there are some awesome content types out there that we can look
> at
> > that will fit our uses, I don't feel confident to say if Atom is a good
> > choice, I've never worked with it..
> >
> > The point where this breaks down I'm gonna guess is in server-side
> > serialization, because we currently don't stream the serialized data, but
> > build it up in memory and ship it off when it's done. I'd say you'll run
> > out
> > of memory after 10000 nodes or so on a small server, which I think
> > underlines how important this is to fix.
> >
> >
> > >
> > > Maybe it's not an issue, but pointers to any problems REST API usage
> > > has with large data sets (and solutions?) would be welcome.
> > >
> >
> > Not aware of anyone bumping into these limits yet, but I'm sure we'll
> start
> > hearing about it.. The only current solution I can think of is a server
> > plugin that emulates this, but it would have to sort the result, and I'm
> > afraid that it will be hard (probably not impossible, but hard) to
> > implement
> > that in a memory-efficient way that far away from the kernel. You may
> just
> > end up moving the OutOfMemeoryExceptions' to the plugin instead of the
> > serialization system.
> >
> >
> > >
> > > --Michael
> > > _______________________________________________
> > > Neo4j mailing list
> > > User@lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> >
> >
> >
> > --
> > Jacob Hansson
> > Phone: +46 (0) 763503395
> > Twitter: @jakewins
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Jacob Hansson
Phone: +46 (0) 763503395
Twitter: @jakewins
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to