On Thu, Apr 21, 2011 at 11:18 PM, Michael Hunger <
michael.hun...@neotechnology.com> wrote:

> Rick,
>
> great thoughts.
>
> Good catch, forgot to add the in-graph representation of the results to my
> mail, thanks for adding that part. Temporary (transient) nodes and
> relationships would really rock here, with the advantage that with HA you
> have them distributed to all cluster nodes.
> Certainly Craig has to add some interesting things to this, as those
> resemble probably his in graph indexes / R-Trees.
>
> As traversers are lazy a count operation is not so easily possible, you
> could run the traversal and discard the results. But then the client could
> also just pull those results until it reaches its
> internal tresholds and then decide to use more filtering or stop the
> pulling and ask the user for more filtering (you can always retrieve n+1 and
> show the user that there are more that "n" results available).
>
> The index result size() method only returns an estimate of the result size
> (which might not contain currently changed index entries).
>
> Please don't forget that a count() query in a RDBMS can be as ridicully
> expensive as the original query (especially if just the column selection was
> replaced with count, and sorting, grouping etc was still left in place
> together with lots of joins).
>
> Sorting on your own instead of letting the db do that mostly harms the
> performance as it requires you to build up all the data in memory, sort it
> and then use it. Instead of having the db do that more efficiently, stream
> the data and you can use it directly from the stream.
>

throw new SlapOnTheFingersException("sometimes the application developer can
do a better job since she has better knowledge of the data, the database
only has generic knowledge");

Since Jake had already mentioned (in this very thread) that he expected one
of those, I thought I might as well throw one in there.

I agree with the analysis of count(), as the name ("count") implies, it will
have to run the entire query in order to count the number of resulting
items.

About sorting I'm torn. The perception of sorting in the database being slow
that Rick points to is one that I've seen a lot. When you hand the
responsibility of sorting to the database you hide the fact that sorting is
an expensive operation, it does require reading in all data in order to sort
it. People often expect databases to be "smarter" than that, since they
sometimes are, but that is pretty much only when reading straight from an
index and not doing much more. A generic sort of data can never be better
than O(log(n!)) [O(log(n!)) is almost equal to, and commonly rounded to the
easier to compute function O(n log(n))]. If you put the responsibility of
sorting in the hands of the application you can sometimes utilize knowledge
about the data to do a more efficient sorting than the database could have
done. Most often by simply doing an application level filtering of the data
before sorting it, based on some filtering that could not be transfered to
the database query. This does however make the work of the application
developer slightly more tedious, which is why I think it would be sensible
to have support for sorting on the database level, and hope that users will
be sensible about using it, and not assume magic from it.

Something I find very interesting is the concept of semi-sorted data.
Semi-sorted data is often good enough, easier to achieve, and quite easy to
then sort completely if that is required. Examples of semi-sorted data could
be data in an order that satisfies the heap property. Or for spatial queries
returning the closest hits first, but not necessarily in perfect order, say
returning the hits within a miles radius first, before the ones in a radius
between 1-10 miles, and so on, without requiring the hits in each 'segment'
to be perfectly ordered by distance. Breadth first order is another example
of semi-sorted data, that could be used when traversing data as you've
outlined with "paging nodes", or similarly "grouped by parent node"-order.

I must say that I really enjoy following this discussion. I really like the
idea of streaming, since I think that can be implemented more easily than
paging, while satisfying many of the desired use cases. But I still want to
hear more arguments for and against both alternatives. And as has already
been pointed out, they aren't mutually exclusive.

I'll keep listening in on the conversation, but I don't have much more to
add at this point. I have one desire for the structure of the conversation
though. When you quote what someone else has said before you, could you
please include who that person was, it makes going back and reading the full
context easier.

Cheers,
-- 
Tobias Ivarsson <tobias.ivars...@neotechnology.com>
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to