Since no one responded yesterday, I wanted to re-emphasize that there are probably substantial optimizations that can be made in a well-known problem domain such as this. For example, by using pre-calculated "relevance" measures for tags, and by narrowing the returned set of posts/nodes as rapidly as possible using the "least used" tag(s) in progressive order. It would be quite trivial (and reasonably performant) to maintain a pair of properties on each node in the tag hierarchy that count the # of relationships of the tag and all its children. Each time a tagging relationship was added to a post, simply add 1 to this property for the tag node and all its ancestors/parents. Then, when you are provided with a list of tags to search upon, order them by the least frequently used tag by leveraging this metric and execute your traversals/set analysis in that order. I also think my proposal for a two-directional search (first one from the direction of the least frequently used tag to the posts that include it, followed by a search from each of those posts back to its tags as described in a previous message) could be quite fast.
Another "compound index" approach that can be used, which is somewhat of a brute force method, is to maintain a property on each tag node that consists of its "aggregrate name" - e.g. Europe.Italy.Toscana.Siena or Activities.Active.Cycling.MountainBiking. When doing a search for Cycling activities in Italy, you could grab the aggregate names for Italy (Europe.Italy) and Cycling (Activities.Active.Cycling), then, using whatever mechanism you choose for your initial node traversal (exhaustive or least-frequently-used tag), you can compare the aggregate name for the tags assigned to a post node to the aggregate names for the desired nodes using a simple String.startsWith(). For example, if I posted regarding a mountain bike ride I took in the hills around Siena, tagged with the above aggregate names, it would successfully match. My first thought was that this could be problematic if a tag term appeared multiple times in the tag hierarchy, but that's easily managed on the query side. Just trying to make the point that sometime abstract or generic traversal schemes aren't always optimal and that it is often worth the effor to explore domain-specific approaches. Does that make any sense? -------- Original Message -------- Subject: Re: [Neo] How to efficiently query in Neo4J? From: Craig Taverner <cr...@amanzi.com> Date: Wed, April 07, 2010 7:05 pm To: Neo user discussions <user@lists.neo4j.org> Hi Alastair, I have been using what you tag the 'composite index' although in mysql. Its > fast, but a pain to manage (as you need to keep the index up to date), so I > would like to stay away from indexes *if possible*. > I would think that you only need to take action when you add or modify a node, and then only to (re)connect it to the index tree (creating index nodes on demand, if missing). This can be embedded in your domain classes, so indexing is automatic. You can even synchronously 'garbage-collect' unused index nodes (if the node unlinked was the last node for that index node). I think the index-service for this needs to be well tested for all scenarios, but should ultimately have a very simple API, with no manual management requirement. My one concern with the composite index for your case is that all my thinking in this has been for numerical indexes, where I plan to query with inequalities (eg. return all restaurants with rating >= 4 stars). I've not thought about how to solve hierarchical tags like you have. One further optimisation is to only store new items in the hash on the first > traversal. Then, in the subsequent traversals, if the key does not exist, > there is no need to add key with count 1, as it cannot ever be emitted. > This > limits the memory requirements to the order of the first traversal, so if > you pick that well, it should be better. > Nice idea. It makes your approach more like the 'one set intersection' approach in term of memory. Picking a good first query seems a common need for many of the solutions. I presume RDBMS have a query optimization phase that figures that out. I'm hoping to completely avoid that kind of non-deterministic approach with the composite index. Cheers, Craig _______________________________________________ Neo mailing list User@lists.neo4j.org [1]https://lists.neo4j.org/mailman/listinfo/user References 1. https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user