>
> I assume this:
>    Traverser x = Traversal.description().traverse( someNode );
>    x.nodes();
>    x.nodes(); // Not necessarily in the same order as previous call.
>
> If that assumption is false or there is some workaround, then I agree that
> this is a valid approach, and a good efficient alternative when sorting is
> not relevant. Glancing at the code in TraverserImpl though, it really looks
> like the call to .nodes  will re-run the traversal, and I thought that
> would
> mean the two calls can yield results in different order?
>

OK. My assumptions were different. I assume that while the order is not
easily predictable, it is reproducable as long as the underlying graph has
not changed. If the graph changes, then the order can change also. But I
think this is true of a relational database also, is it not?

So, obviously pagination is expected (by me at least) to give page X as it
is at the time of the request for page X, not at the time of the request for
page 1.

But my assumptions could be incorrect too...

I understand, and completely agree. My problem with the approach is that I
> think its harder than it looks at first glance.
>

I guess I cannot argue that point. My original email said I did not know if
this idea had been solved yet. Since some of the key people involved in this
have not chipped into this discussion, either we are reasonably correct in
our ideas, or so wrong that they don't know where to begin correcting us ;-)

This is what makes me push for the sorted approach - relational databases
> are doing this. I don't know how they do it, but they are, and we should be
> at least as good.
>

Absolutely. We should be as good. Relational database manage to serve a page
deep down the list quite fast. I must believe if they had to complete the
traversal, sort the results and extract the page on every single page
request, they could not be so fast. I think my ideas for the traversal are
'supposed' to be performance enhancements, and that is why I like them ;-)

I agree the issue of what should be indexed to optimize sorting is a
> domain-specific problem, but I think that is how relational databases treat
> it as well. If you want sorting to be fast, you have to tell them to index
> the field you will be sorting on. The only difference contra having the
> user
> put the sorting index in the graph is that relational databases will handle
> the indexing for you, saving you a *ton* of work, and I think we should
> too.
>

Yes. I was discussing automatic indexing with Mattias recently. I think (and
hope I am right), that once we move to automatic indexes, then it will be
possible to put external indexes (a'la lucene) and graph indexes (like the
ones I favour) behind the same API. In this case perhaps the database will
more easily be able to make the right optimized decisions, and use the index
for providing sorted results fast and with low memory footprint where
possible, based on the existance or non-existance of the necessary indices.
Then all the developer needs to do to make things really fast is put in the
right index. For some data, that would be lucene and for others it would be
a graph index. If we get to this point, I think we will have closed a key
usability gap with relational databases.

There are cases where you need to add this sort of meta data to your domain
> model, where the sorting logic is too complex, and you see that in
> relational dbs as well, where people create lookup tables for various
> things. There are for sure valid uses for that too, but the generic
> approach
> I believe covers the *vast* majority of the common use cases.
>

Perhaps. But I'm not sure the two extremes are as lop-sided as you think. I
think large data users are very interested in Neo4j.

I agree, this is important. I'd like to change "the need for pagination on
> very large result sets" to "the ability to return very large result sets
> over the wire". That opens up the debate to solutions like http streaming,
> which do not have the problems that come with keeping state on the server
> between calls.
>

I think there are two separate, but related, problems to solve. One is the
transfer of large result-sets over the wire for people that need that. The
other is efficiently providing the small page of results from a large
dataset. Most of our discussion has so far focused on the latter.

For the former, I did a bit of experimenting last year and was able to
compact my JSON by several times by moving all meta-data into a header
section. This works very well for data that has a repeating structure, for
example a large number of records with similar schema. I know schema is a
nasty word in the nosql world, but it is certainly common for data to have a
repeating pattern, especially when dealing with very large numbers. Then you
find that something like CSV is actually an efficient format, since the bulk
of the text is only the data. We did this in JSON by simply specifying a
meta-data element (with the headers) and then a contents section with a long
array of values. It worked very well indeed, even though we have a
half-dozen different 'schema's in the document, it was still much more
efficient than specifying the meaning of every field as usually done in JSON
or XML.
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to