On Wed, Apr 20, 2011 at 7:42 PM, Craig Taverner <cr...@amanzi.com> wrote:

> To respond to your arguments it would be worth noting a comment by Michael
> DeHaan later on in this thread. He asked for 'something more or less
> resembling a database cursor (see MongoDB's API).' The trick is to achieve
> this without having to store a lot of state on the server, so it is robust
> against server restarts or crashes.
>
> If we compare to the SQL situation, there are two numbers passed by the
> client, the page size and the offset. The state can be re-created by the
> database server entirely from this information. How this is implemented in
> a
> relational database I do not know, but whether the database is relational
> or
> a graph, certain behaviors would be expected, like robustness against
> database content changes between the requests, and coping with very long
> gaps between requests. In my opinion the database cursor could be achieved
> by both of the following approaches:
>
>   - Starting the traversal from the beginning, and only returning results
>   after passing the cursor offset position
>

I assume this:

    Traverser x = Traversal.description().traverse( someNode );
    x.nodes();
    x.nodes(); // Not necessarily in the same order as previous call.

If that assumption is false or there is some workaround, then I agree that
this is a valid approach, and a good efficient alternative when sorting is
not relevant. Glancing at the code in TraverserImpl though, it really looks
like the call to .nodes  will re-run the traversal, and I thought that would
mean the two calls can yield results in different order?

  - Keeping a live traverser in the server, and continuing it from the
>   previous position
>
> Personally I think the second approach is simply a performance optimization
> of the first. So robustness is achieved by having both, with the second one
> working when possible (no server restarts, timeout not expiring, etc.), and
> falling back to the first in other cases. This achieves performance and
> robustness. What we do not need to do with either case is keep an entire
> result set in memory between client requests.
>

I understand, and completely agree. My problem with the approach is that I
think its harder than it looks at first glance.


>
> Now when you add sorting into the picture, then you need to generate the
> complete result-set in memory, sort, paginate and return only the requested
> page. If the entire process has to be repeated for every page requested,
> this could perform very badly for large result sets. I must believe that
> relational databases do not do this (but I do not know how they paginate
> sorted results, unless the sort order is maintained in an index).
>

This is what makes me push for the sorted approach - relational databases
are doing this. I don't know how they do it, but they are, and we should be
at least as good.


>
> To avoid keeping everything in memory, or repeatedly reloading everything
> to
> memory on every page request, we need sorted results to be produced on the
> stream. This can be done by keeping the sort order in an index. This is
> very
> hard to do in a generic way, which is why I thought it best done in a
> domain
> specific way.
>

I agree the issue of what should be indexed to optimize sorting is a
domain-specific problem, but I think that is how relational databases treat
it as well. If you want sorting to be fast, you have to tell them to index
the field you will be sorting on. The only difference contra having the user
put the sorting index in the graph is that relational databases will handle
the indexing for you, saving you a *ton* of work, and I think we should too.

There are cases where you need to add this sort of meta data to your domain
model, where the sorting logic is too complex, and you see that in
relational dbs as well, where people create lookup tables for various
things. There are for sure valid uses for that too, but the generic approach
I believe covers the *vast* majority of the common use cases.


> Finally, I think we are really looking at two, different but valid use
> cases. The need for generic sorting combined with pagination, and the need
> for pagination on very large result sets. The former use case can work with
> re-traversing and sorting on each client request, is fully generic, but
> will
> perform badly on large result sets. The latter can perform adequately on
> large result sets, as long as you do not need to sort (and use the database
> cursor approach to avoid loading the result set into memory).
>

I agree, this is important. I'd like to change "the need for pagination on
very large result sets" to "the ability to return very large result sets
over the wire". That opens up the debate to solutions like http streaming,
which do not have the problems that come with keeping state on the server
between calls.


>
> On Wed, Apr 20, 2011 at 2:01 PM, Jacob Hansson <ja...@voltvoodoo.com>
> wrote:
>
> > On Wed, Apr 20, 2011 at 11:25 AM, Craig Taverner <cr...@amanzi.com>
> wrote:
> >
> > > I think sorting would need to be optional, since it is likely to be a
> > > performance and memory hug on large traversals. I think one of the key
> > > benefits of the traversal framework in the Embedded API is being able
> to
> > > traverse and 'stream' a very large graph without occupying much memory.
> > If
> > > this can be achieved in the REST API (through pagination), that is a
> very
> > > good thing. I assume the main challenge is being able to freeze a
> > traverser
> > > and keep it on hold between client requests for the next page. Perhaps
> > you
> > > have already solved that bit?
> > >
> >
> > While I agree with you that the ability to effectively stream the results
> > of
> > a traversal is a very useful thing, I don't like the persisted traverser
> > approach, for several reasons. I'm sorry if my tone below is a bit harsh,
> I
> > don't mean it that way, I simply want to make a strong case for why I
> think
> > the hard way is the right way in this case.
> >
> > First, the only good restful approach I can think of for doing persisted
> > traversals would be to "create" a traversal resource (since it is an
> object
> > that keeps persistent state), and get back an id to refer to it.
> Subsequent
> > calls to paged results would then be to that traversal resource, updating
> > its state and getting results back. Assuming this is the correct way to
> > implement this, it comes with a lot of questions. Should there be a
> timeout
> > for these resources, or is the user responsible for removing them from
> > memory? What happens when the server crashes and the client can't find
> the
> > traversal resources it has ids for?
> >
> > If we somehow solve that or find some better approach, we end up with an
> > API
> > where a client can get paged results, but two clients performing the same
> > traversal on the same data may get back the same result in different
> order
> > (see my comments on sorting based on expected traversal behaviour below).
> > This means that the API is really only useful if you actually want to get
> > the entire result back. If that was the problem we wanted to solve, a
> > streaming solution is a much easier and faster approach than a paging
> > solution.
> >
> > Second, being able to iterate over the entire result set is only half of
> > the
> > use cases we are looking to solve. The other half are the ones I
> mentioned
> > examples of (the blog case, presenting lists of things to users and so
> on),
> > and those are not solved by this. Forcing users of our database to pull
> out
> > all their data over the wire and sort the whole thing, only to keep the
> > first 10 items, for each user that lands on their frontpage, is not ok.
> >
> > Third, and most importantly to me, using this case to put more pressure
> on
> > ourselves to implement real sorting is a really good thing. Sorting is
> > something that *really* should be provided by us, anyone who has used a
> > modern database expects this to be our problem to solve. We have a really
> > good starting point for optimizing sorting algorithms, sitting as we are
> > inside the kernel with our caches and indexes :)
> >
> >
> > >
> > > In my opinion, I would code the sorting as a characteristic of the
> graph
> > > itself, in order to avoid having to sort in the server (and incur the
> > > memory/performance hit). So that means I would use a domain-specific
> > > solution to sorting. Of course, generic sorting is nice also, but make
> it
> > > optional.
> > >
> >
> > I agree sorting should be an opt-in feature. Putting meta-data like
> sorting
> > order and similar things inside the graph I think is a matter of personal
> > preference, and for sure has its place as a useful optimization. I do,
> > however, think that the "official" approach to sorting needs to be based
> on
> > concepts familiar from other databases - define your query, and define
> how
> > you want the result sorted. If indexes are available the database can use
> > them to optimize the sorting, otherwise it will suck, but at least we're
> > doing what the user wants us to do. All lessons learned in YesSQL
> databases
> > (see what I did there?) should not be unlearned :)
> >
> > Also, the approach of sorting via the traversal itself assumes knowledge
> of
> > which order the traverser will move through the graph, and that is not
> > necessarily something that will be the same in later releases. Tobias was
> > talking about cache-first traversals as an addition or even a replacement
> > to
> > depth/breadth first ones, a major optimization we cannot do if we
> encourage
> > people to sort "inside" the graph.
> >
> > /Jake
> >
> >
> > >
> > > On Wed, Apr 20, 2011 at 11:19 AM, Jacob Hansson <ja...@voltvoodoo.com
> > > >wrote:
> > >
> > > > On Tue, Apr 19, 2011 at 10:17 PM, Michael DeHaan
> > > > <michael.deh...@gmail.com>wrote:
> > > >
> > > > > On Tue, Apr 19, 2011 at 10:58 AM, Jim Webber <
> j...@neotechnology.com>
> > > > > wrote:
> > > > > >>> I'd like to propose that we put this functionality into the
> > plugin
> > > (
> > > > > https://github.com/skanjila/gremlin-translation-plugin) that Peter
> > and
> > > I
> > > > > are currently working on, thoughts?
> > > > > >
> > > > > > I'm thinking that, if we do it, it should be handled through
> > content
> > > > > negotiation. That is if you ask for application/atom then you get
> > paged
> > > > > lists of results. I don't necessarily think that's a plugin, it's
> > more
> > > > > likely part of the representation logic in server itself.
> > > > >
> > > > > This is something I've been wondering about as I may have the need
> to
> > > > > feed very large graphs into the system and am wondering how the
> REST
> > > > > API will hold up compared to the native interface.
> > > > >
> > > > > What happens if the result of an index query (or traversal,
> whatever)
> > > > > legitimately needs to return 100k results?
> > > > >
> > > > > Wouldn't that be a bit large for one request?   If anything, it's a
> > > > > lot of JSON to decode at once.
> > > > >
> > > > >
> > > > Yeah, we can't do this right now, and implementing it is harder than
> it
> > > > seems at first glance, since we first need to implement sorting of
> > > results,
> > > > otherwise the paged result will be useless. Like Jim said though,
> this
> > is
> > > > another one of those *must be done* features.
> > > >
> > > >
> > > > > Feeds make sense for things that are feed-like, but do atom feeds
> > > > > really make sense for results of very dynamic queries that don't
> get
> > > > > subscribed to?
> > > > > Or, related question, is there a point where the result sets of
> > > > > operations get so large that things start to break down?   What do
> > > > > people find this to generally be?
> > > > >
> > > >
> > > > I'm sure there are some awesome content types out there that we can
> > look
> > > at
> > > > that will fit our uses, I don't feel confident to say if Atom is a
> good
> > > > choice, I've never worked with it..
> > > >
> > > > The point where this breaks down I'm gonna guess is in server-side
> > > > serialization, because we currently don't stream the serialized data,
> > but
> > > > build it up in memory and ship it off when it's done. I'd say you'll
> > run
> > > > out
> > > > of memory after 10000 nodes or so on a small server, which I think
> > > > underlines how important this is to fix.
> > > >
> > > >
> > > > >
> > > > > Maybe it's not an issue, but pointers to any problems REST API
> usage
> > > > > has with large data sets (and solutions?) would be welcome.
> > > > >
> > > >
> > > > Not aware of anyone bumping into these limits yet, but I'm sure we'll
> > > start
> > > > hearing about it.. The only current solution I can think of is a
> server
> > > > plugin that emulates this, but it would have to sort the result, and
> > I'm
> > > > afraid that it will be hard (probably not impossible, but hard) to
> > > > implement
> > > > that in a memory-efficient way that far away from the kernel. You may
> > > just
> > > > end up moving the OutOfMemeoryExceptions' to the plugin instead of
> the
> > > > serialization system.
> > > >
> > > >
> > > > >
> > > > > --Michael
> > > > > _______________________________________________
> > > > > Neo4j mailing list
> > > > > User@lists.neo4j.org
> > > > > https://lists.neo4j.org/mailman/listinfo/user
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jacob Hansson
> > > > Phone: +46 (0) 763503395
> > > > Twitter: @jakewins
> > > > _______________________________________________
> > > > Neo4j mailing list
> > > > User@lists.neo4j.org
> > > > https://lists.neo4j.org/mailman/listinfo/user
> > > >
> > > _______________________________________________
> > > Neo4j mailing list
> > > User@lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> >
> >
> >
> > --
> > Jacob Hansson
> > Phone: +46 (0) 763503395
> > Twitter: @jakewins
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Jacob Hansson
Phone: +46 (0) 763503395
Twitter: @jakewins
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to