On Thu, Apr 21, 2011 at 2:52 PM, Craig Taverner <cr...@amanzi.com> wrote:

> >
> > I assume this:
> >    Traverser x = Traversal.description().traverse( someNode );
> >    x.nodes();
> >    x.nodes(); // Not necessarily in the same order as previous call.
> >
> > If that assumption is false or there is some workaround, then I agree
> that
> > this is a valid approach, and a good efficient alternative when sorting
> is
> > not relevant. Glancing at the code in TraverserImpl though, it really
> looks
> > like the call to .nodes  will re-run the traversal, and I thought that
> > would
> > mean the two calls can yield results in different order?
> >
>
> OK. My assumptions were different. I assume that while the order is not
> easily predictable, it is reproducable as long as the underlying graph has
> not changed. If the graph changes, then the order can change also. But I
> think this is true of a relational database also, is it not?
>
> So, obviously pagination is expected (by me at least) to give page X as it
> is at the time of the request for page X, not at the time of the request
> for
> page 1.
>
> But my assumptions could be incorrect too...
>

I think you are probably right about that, and if you don't provide a sort
order, then I think a SQL database will exert the same sort of unknown
behaviour, like you say.

Leaving out single-user single-threaded applications then, it must be
assumed that the database will be accessed by other parties while we page
through our result. If the cache-first traversal optimization gets
implemented, it might even be enough to read the results of the traversal
for the sorting order the next time around to be different. Point being,
there is a reasonable assumption that parts of the traversal result will
never get returned due to the shifting sort order.

I can only think of a few use cases where loosing some of the expected
result is ok, for instance if you want to "peek" at the result.


>
> I understand, and completely agree. My problem with the approach is that I
> > think its harder than it looks at first glance.
> >
>
> I guess I cannot argue that point. My original email said I did not know if
> this idea had been solved yet. Since some of the key people involved in
> this
> have not chipped into this discussion, either we are reasonably correct in
> our ideas, or so wrong that they don't know where to begin correcting us
> ;-)
>

I'm waiting for one of those SlapOnTheFingersExceptions' that Tobias has
been handing out :)


>
> This is what makes me push for the sorted approach - relational databases
> > are doing this. I don't know how they do it, but they are, and we should
> be
> > at least as good.
> >
>
> Absolutely. We should be as good. Relational database manage to serve a
> page
> deep down the list quite fast. I must believe if they had to complete the
> traversal, sort the results and extract the page on every single page
> request, they could not be so fast. I think my ideas for the traversal are
> 'supposed' to be performance enhancements, and that is why I like them ;-)
>

I think they are performance enhancements, huge ones. But I still think
there are hard problems involved in putting them into practice.


>
I agree the issue of what should be indexed to optimize sorting is a
> > domain-specific problem, but I think that is how relational databases
> treat
> > it as well. If you want sorting to be fast, you have to tell them to
> index
> > the field you will be sorting on. The only difference contra having the
> > user
> > put the sorting index in the graph is that relational databases will
> handle
> > the indexing for you, saving you a *ton* of work, and I think we should
> > too.
> >
>
> Yes. I was discussing automatic indexing with Mattias recently. I think
> (and
> hope I am right), that once we move to automatic indexes, then it will be
> possible to put external indexes (a'la lucene) and graph indexes (like the
> ones I favour) behind the same API. In this case perhaps the database will
> more easily be able to make the right optimized decisions, and use the
> index
> for providing sorted results fast and with low memory footprint where
> possible, based on the existance or non-existance of the necessary indices.
> Then all the developer needs to do to make things really fast is put in the
> right index. For some data, that would be lucene and for others it would be
> a graph index. If we get to this point, I think we will have closed a key
> usability gap with relational databases.
>

Couldn't agree more :)


>
> There are cases where you need to add this sort of meta data to your domain
> > model, where the sorting logic is too complex, and you see that in
> > relational dbs as well, where people create lookup tables for various
> > things. There are for sure valid uses for that too, but the generic
> > approach
> > I believe covers the *vast* majority of the common use cases.
> >
>
> Perhaps. But I'm not sure the two extremes are as lop-sided as you think. I
> think large data users are very interested in Neo4j.
>

Yeah, you are right about that, and I realize the sorted approach would not
work if the result gets large enough.. If we can find a good way to
implement the stateful traversal result as an optimization to paged results,
then I am all for it.


>
>
>
I agree, this is important. I'd like to change "the need for pagination on
> > very large result sets" to "the ability to return very large result sets
> > over the wire". That opens up the debate to solutions like http
> streaming,
> > which do not have the problems that come with keeping state on the server
> > between calls.
> >
>
> I think there are two separate, but related, problems to solve. One is the
> transfer of large result-sets over the wire for people that need that. The
> other is efficiently providing the small page of results from a large
> dataset. Most of our discussion has so far focused on the latter.
>
> For the former, I did a bit of experimenting last year and was able to
> compact my JSON by several times by moving all meta-data into a header
> section. This works very well for data that has a repeating structure, for
> example a large number of records with similar schema. I know schema is a
> nasty word in the nosql world, but it is certainly common for data to have
> a
> repeating pattern, especially when dealing with very large numbers. Then
> you
> find that something like CSV is actually an efficient format, since the
> bulk
> of the text is only the data. We did this in JSON by simply specifying a
> meta-data element (with the headers) and then a contents section with a
> long
> array of values. It worked very well indeed, even though we have a
> half-dozen different 'schema's in the document, it was still much more
> efficient than specifying the meaning of every field as usually done in
> JSON
> or XML.
>

This sounds really cool, would be a great thing to look into!


> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Jacob Hansson
Phone: +46 (0) 763503395
Twitter: @jakewins
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to