> > Good catch, forgot to add the in-graph representation of the results to my > mail, thanks for adding that part. Temporary (transient) nodes and > relationships would really rock here, with the advantage that with HA you > have them distributed to all cluster nodes. > Certainly Craig has to add some interesting things to this, as those > resemble probably his in graph indexes / R-Trees. >
I certainly make use of this model, much more so for my statistical analysis than for graph indexes (but I'm planning to merge indexes and statistics). However, in my case the structures are currently very domain specific. But I think the idea is sound and should be generalizable. What I do is have a concept of a 'dataset' on which queries can be performed. The dataset is usually the root of a large sub-graph. The query parser (domain specific) creates a hashcode of the query, checks if the dataset node already has a resultset (as a connected sub-graph with its own root node containing the previous query hashcode), and if so return that (traverse it), otherwise perform the complete dataset traversal, creating the resultset as a new subgraph and then return it. This works well specifically for statistical queries, where the resultset is much smaller than the dataset, so adding new subgraphs has small impact on the database size, and the resultset is much faster to return, so this is a performance enhancement for multiple requests from the client. Also, I keep the resultset permanently, not temporarily. Very few operations modify the dataset, and if they do, we delete all resultsets, and they get re-created the next time. My work on merging the indexes with the statistics is also planned to only recreate 'dirty' subsets of the result-set, so modifying the dataset has minimal impact on the query performance. After reading Rick's previous email I started thinking of approaches to generalizing this, but I think your 'transient' nodes perhaps encompass everything I thought about. Here is an idea: - Have new nodes/relations/properties tables on disk, like a second graph database, but different in the sense that it has one-way relations into the main database, which cannot be seen by the main graph and so are by definition not part of the graph. These can have transience and expiry characteristics. Then we can build the resultset graphs as transient graphs in the transient database, with 'drill-down' capabilities to the original graph (something I find I always need for statistical queries, and something a graph is simply much better at than a relational database). - Use some kind of hashcode in the traversal definition or query to identify existing, cached, transient graphs in the second database, so you can rely on those for repeated queries, or pagination or streaming, etc. As traversers are lazy a count operation is not so easily possible, you > could run the traversal and discard the results. But then the client could > also just pull those results until it reaches its > internal tresholds and then decide to use more filtering or stop the > pulling and ask the user for more filtering (you can always retrieve n+1 and > show the user that there are more that "n" results available). > Yes. Count needs to perform the traversal. So the only way to not have to traverse twice is to keep a cache. If we make the cache a transient sub-graph (possibly in the second database I described above), then we have the interesting behaviour that count() takes a while, but subsequent queries, pagination or streaming, are fast. Please don't forget that a count() query in a RDBMS can be as ridicully > expensive as the original query (especially if just the column selection was > replaced with count, and sorting, grouping etc was still left in place > together with lots of joins). > Good to hear they have the same problem as us :-) (or even more problems) Sorting on your own instead of letting the db do that mostly harms the > performance as it requires you to build up all the data in memory, sort it > and then use it. Instead of having the db do that more efficiently, stream > the data and you can use it directly from the stream. > Client side sorting makes sense if you know the domain well enough to know, for example, you will receive a small enough result set to 'fit' in the client, and want to give the user multiple interactive sort options without hitting the database again. But I agree that in general it makes sense to get the database to do the sort. Cheers, Craig _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user