On Thu, Apr 21, 2011 at 11:18 PM, Michael Hunger < michael.hun...@neotechnology.com> wrote:
> Rick, > > great thoughts. > > Good catch, forgot to add the in-graph representation of the results to my > mail, thanks for adding that part. Temporary (transient) nodes and > relationships would really rock here, with the advantage that with HA you > have them distributed to all cluster nodes. > Certainly Craig has to add some interesting things to this, as those > resemble probably his in graph indexes / R-Trees. > > As traversers are lazy a count operation is not so easily possible, you > could run the traversal and discard the results. But then the client could > also just pull those results until it reaches its > internal tresholds and then decide to use more filtering or stop the > pulling and ask the user for more filtering (you can always retrieve n+1 and > show the user that there are more that "n" results available). > > The index result size() method only returns an estimate of the result size > (which might not contain currently changed index entries). > > Please don't forget that a count() query in a RDBMS can be as ridicully > expensive as the original query (especially if just the column selection was > replaced with count, and sorting, grouping etc was still left in place > together with lots of joins). > > Sorting on your own instead of letting the db do that mostly harms the > performance as it requires you to build up all the data in memory, sort it > and then use it. Instead of having the db do that more efficiently, stream > the data and you can use it directly from the stream. > throw new SlapOnTheFingersException("sometimes the application developer can do a better job since she has better knowledge of the data, the database only has generic knowledge"); Since Jake had already mentioned (in this very thread) that he expected one of those, I thought I might as well throw one in there. I agree with the analysis of count(), as the name ("count") implies, it will have to run the entire query in order to count the number of resulting items. About sorting I'm torn. The perception of sorting in the database being slow that Rick points to is one that I've seen a lot. When you hand the responsibility of sorting to the database you hide the fact that sorting is an expensive operation, it does require reading in all data in order to sort it. People often expect databases to be "smarter" than that, since they sometimes are, but that is pretty much only when reading straight from an index and not doing much more. A generic sort of data can never be better than O(log(n!)) [O(log(n!)) is almost equal to, and commonly rounded to the easier to compute function O(n log(n))]. If you put the responsibility of sorting in the hands of the application you can sometimes utilize knowledge about the data to do a more efficient sorting than the database could have done. Most often by simply doing an application level filtering of the data before sorting it, based on some filtering that could not be transfered to the database query. This does however make the work of the application developer slightly more tedious, which is why I think it would be sensible to have support for sorting on the database level, and hope that users will be sensible about using it, and not assume magic from it. Something I find very interesting is the concept of semi-sorted data. Semi-sorted data is often good enough, easier to achieve, and quite easy to then sort completely if that is required. Examples of semi-sorted data could be data in an order that satisfies the heap property. Or for spatial queries returning the closest hits first, but not necessarily in perfect order, say returning the hits within a miles radius first, before the ones in a radius between 1-10 miles, and so on, without requiring the hits in each 'segment' to be perfectly ordered by distance. Breadth first order is another example of semi-sorted data, that could be used when traversing data as you've outlined with "paging nodes", or similarly "grouped by parent node"-order. I must say that I really enjoy following this discussion. I really like the idea of streaming, since I think that can be implemented more easily than paging, while satisfying many of the desired use cases. But I still want to hear more arguments for and against both alternatives. And as has already been pointed out, they aren't mutually exclusive. I'll keep listening in on the conversation, but I don't have much more to add at this point. I have one desire for the structure of the conversation though. When you quote what someone else has said before you, could you please include who that person was, it makes going back and reading the full context easier. Cheers, -- Tobias Ivarsson <tobias.ivars...@neotechnology.com> Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user