>
> Good catch, forgot to add the in-graph representation of the results to my
> mail, thanks for adding that part. Temporary (transient) nodes and
> relationships would really rock here, with the advantage that with HA you
> have them distributed to all cluster nodes.
> Certainly Craig has to add some interesting things to this, as those
> resemble probably his in graph indexes / R-Trees.
>

I certainly make use of this model, much more so for my statistical analysis
than for graph indexes (but I'm planning to merge indexes and statistics).

However, in my case the structures are currently very domain specific. But I
think the idea is sound and should be generalizable. What I do is have a
concept of a 'dataset' on which queries can be performed. The dataset is
usually the root of a large sub-graph. The query parser (domain specific)
creates a hashcode of the query, checks if the dataset node already has a
resultset (as a connected sub-graph with its own root node containing the
previous query hashcode), and if so return that (traverse it), otherwise
perform the complete dataset traversal, creating the resultset as a new
subgraph and then return it. This works well specifically for statistical
queries, where the resultset is much smaller than the dataset, so adding new
subgraphs has small impact on the database size, and the resultset is much
faster to return, so this is a performance enhancement for multiple requests
from the client. Also, I keep the resultset permanently, not temporarily.
Very few operations modify the dataset, and if they do, we delete all
resultsets, and they get re-created the next time. My work on merging the
indexes with the statistics is also planned to only recreate 'dirty' subsets
of the result-set, so modifying the dataset has minimal impact on the query
performance.

After reading Rick's previous email I started thinking of approaches to
generalizing this, but I think your 'transient' nodes perhaps encompass
everything I thought about. Here is an idea:

   - Have new nodes/relations/properties tables on disk, like a second graph
   database, but different in the sense that it has one-way relations into the
   main database, which cannot be seen by the main graph and so are by
   definition not part of the graph. These can have transience and expiry
   characteristics. Then we can build the resultset graphs as transient graphs
   in the transient database, with 'drill-down' capabilities to the original
   graph (something I find I always need for statistical queries, and something
   a graph is simply much better at than a relational database).
   - Use some kind of hashcode in the traversal definition or query to
   identify existing, cached, transient graphs in the second database, so you
   can rely on those for repeated queries, or pagination or streaming, etc.

As traversers are lazy a count operation is not so easily possible, you
> could run the traversal and discard the results. But then the client could
> also just pull those results until it reaches its
> internal tresholds and then decide to use more filtering or stop the
> pulling and ask the user for more filtering (you can always retrieve n+1 and
> show the user that there are more that "n" results available).
>

Yes. Count needs to perform the traversal. So the only way to not have to
traverse twice is to keep a cache. If we make the cache a transient
sub-graph (possibly in the second database I described above), then we have
the interesting behaviour that count() takes a while, but subsequent
queries, pagination or streaming, are fast.

Please don't forget that a count() query in a RDBMS can be as ridicully
> expensive as the original query (especially if just the column selection was
> replaced with count, and sorting, grouping etc was still left in place
> together with lots of joins).
>

Good to hear they have the same problem as us :-)
(or even more problems)

Sorting on your own instead of letting the db do that mostly harms the
> performance as it requires you to build up all the data in memory, sort it
> and then use it. Instead of having the db do that more efficiently, stream
> the data and you can use it directly from the stream.
>

Client side sorting makes sense if you know the domain well enough to know,
for example, you will receive a small enough result set to 'fit' in the
client, and want to give the user multiple interactive sort options without
hitting the database again. But I agree that in general it makes sense to
get the database to do the sort.

Cheers, Craig
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to