Ah, if only it were so... The number of indexable properties (tags) is completely variable on a "per car" basis (e.g. I can add a "driverMood" tag for just a subset of cars) - meaning that the domain objects themselves can have a variable number of "tags" and can indeed even be tagged with two values from the same vocabulary (e.g. a car can have two-color paint, red and blue).
The round-robin idea has some merit, but of course, identifying/determining the sub-tree width (# of index randomly assigned index subnodes) is somewhat subjective in terms of determining what would help address the concurrency issues at the possible expense of traversal performance. Also, the "hotspot" or "supernode" issue exists a number of other places in our application wherever we are constantly adding (or removing) content related to an entity in the system. It seems that a lot of the current users of Neo are doing "bulk loads" and using it for analysis as opposed to using it like an OLTP data store (like we are), so I'm guessing the hotspot issue is unique to our domain. I'm still leaning towards Lucene, but will experiment with a few approaches to see what works best in different scenarios, and will try implementing something along the lines of what you describe. -----Original Message----- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Craig Taverner Sent: Monday, May 02, 2011 11:29 AM To: Neo4j user discussions Subject: Re: [Neo4j] Lucene/Neo Indexing Question Thinking back you your original domain description, cars with colors, surely you have more properties than just colors to index? If you have two or more properties, then you use combinations of properties for the first level of the index tree, which provides your logical partitioning of supernodes in a domain specific way. For example, considering having the four properties color, manufacturer, model, year. The first level of index nodes would be the set of unique combinations of all possible properties (all existing combinations, actually). This set is much larger than the set of colors. So red will occur many times. As a result you dramatically reduce node contention, and the number of relationships per node is much less. Then if you want to perform the query for all red cars, actually your traverser needs to be only slightly more complex, basically 'find all cars with color red and any value of the other properties'. This is the design of the 'amanzi-index' I started on github in December (but did not complete). It was focusing on doing queries on multiple properties at the same time, but does effectively cover your case of reducing node contention, if you can add more properties to the index. It also has the concept of a mapper from the domain specific property to the index key, which was designed to reduce the number of index nodes, but in your case you could also use it to increase the number of index nodes, using some of the ideas by Jim and Michael. Jim suggested that instead or 'red' always mapping to the same node, it could map to a set of different nodes (randomly selected, or round robin). Michael discussed a distributed hash-code, which I do not fully understand, but it does sound relevant :-) So, in short, using the design of the amanzi-index you could help this problem in two ways: - index together with other properties to get a domain-specific partitioning of the 'supernodes' - Add a mapper between the color and the index key to get partitioning of the supernodes On Mon, May 2, 2011 at 1:09 PM, Rick Bullotta <rick.bullo...@thingworx.com>wrote: > Hi, Michael. > > The nature of the domain model really doesn't lend itself to any logical > partioning of "supernodes", so it would indeed have to be something very > arbitary/random. > > For now, I think we will have to either deal with the performance issues or > switch to using Lucene for the indexing, but we can't do that yet until we > have the ability to query the list of terms for a given key (which is a > necessary function in our domain model). We could perhaps keep a list of > "terms" as nodes *and* index them, but that seems redundant. > > Ultimately, I think the solution is to hide the complexity via the indexing > framework and to offer a variety of in-graph indexing models that address > specific types of domain requirements. > > Rick > > ________________________________________ > From: user-boun...@lists.neo4j.org [user-boun...@lists.neo4j.org] On > Behalf Of Michael Hunger [michael.hun...@neotechnology.com] > Sent: Monday, May 02, 2011 3:49 AM > To: Neo4j user discussions > Subject: Re: [Neo4j] Lucene/Neo Indexing Question > > Perhaps then it is sensible to introduce a second layer of nodes, so that > you split down your "supernodes" and distribute the write contention? > > Would be interesting if putting a round robin on that second level of color > nodes would be enough to spread lock contention? > > This is what peter talks about in his activity stream update scenario. > > And in general perhaps a step to a more performant in-graph index. > > When thinking about in-graph indexes I thought it might perhaps be > interesting to re-use the HashMap approach of declaring x (2^n) bucket-nodes > then having from the index-root node relationships with the (re-distributed) > hashcode & (x-1) relationship-types to the bucket nodes and below the bucket > node rels with the concrete value as an relationship attribute to the > concrete nodes. > > I think this will be addressed even better with Craig's indexes or the > Collection abstractions that Andreas Kollegger is working on. > > Cheers > > Michael > > Am 02.05.2011 um 12:16 schrieb Rick Bullotta: > > > Hi, Niels. > > > > That's what we're doing now, but it has performance issues with large #'s > of relationships when "cars" are constantly being added, since the "color" > nodes become synchronization bottlenecks for updates. > > > > Rick > > > > ________________________________________ > > From: user-boun...@lists.neo4j.org [user-boun...@lists.neo4j.org] On > Behalf Of Niels Hoogeveen [pd_aficion...@hotmail.com] > > Sent: Sunday, May 01, 2011 9:41 AM > > To: user@lists.neo4j.org > > Subject: Re: [Neo4j] Lucene/Neo Indexing Question > > > > One option would be to create a unique value node for each distinct color > and create a relationship from car to that value node. The value nodes can > be grouped together with relationships to some reference node. > > > > This gives the opportunity of finding all distinct colors, and it allows > you to find all cars with that particular color. > >> Date: Sun, 1 May 2011 14:41:40 +0200 > >> From: matt...@neotechnology.com > >> To: user@lists.neo4j.org > >> Subject: Re: [Neo4j] Lucene/Neo Indexing Question > >> > >> 2011/4/26 Rick Bullotta <rick.bullo...@thingworx.com>: > >>> Hi, Mattias. > >>> > >>> Here's a use case: > >>> > >>> I have a million nodes representing cars, and those nodes are all > "tagged" with some value, let's say a color name, as a property. I have > indexed those nodes on the color property value. Now I'd like to present a > list of the distinct color values with which nodes (cars) have been tagged. > At present, I'd need to iterate through all million, read the property, and > maintain a "distinct" HashSet as I iterate through them. > >>> > >>> I've tried using relationships from the "car" node(s) to a set of > "color" node(s), but had scalability/performance issues when there are lots > of car nodes being added/deleted (the "color" node quickly becomes a hot > spot/synchronization choke point). > >> > >> Allright, yeah such nodes can become bottlenecks, so I see your > >> problem for sure. > >>> > >>> Rick > >>> > >>> > >>> -----Original Message----- > >>> From: user-boun...@lists.neo4j.org [mailto: > user-boun...@lists.neo4j.org] On Behalf Of Mattias Persson > >>> Sent: Tuesday, April 26, 2011 2:17 PM > >>> To: Neo4j user discussions > >>> Subject: Re: [Neo4j] Lucene/Neo Indexing Question > >>> > >>> Hi Rick, > >>> > >>> No, not really. What the use case for having such a method? > >>> > >>> 2011/4/26 Rick Bullotta <rick.bullo...@thingworx.com>: > >>>> Hi, all. > >>>> > >>>> Is there a method or suggested approach for obtaining a list of all of > the distinct key values in a given index? I don't care about the indexed > nodes or relationships themselves, just the value(s) of the key. > >>>> > >>>> Thanks, > >>>> > >>>> Rick > >>>> > >>>> _______________________________________________ > >>>> Neo4j mailing list > >>>> User@lists.neo4j.org > >>>> https://lists.neo4j.org/mailman/listinfo/user > >>>> > >>> > >>> > >>> > >>> -- > >>> Mattias Persson, [matt...@neotechnology.com] > >>> Hacker, Neo Technology > >>> www.neotechnology.com > >>> _______________________________________________ > >>> Neo4j mailing list > >>> User@lists.neo4j.org > >>> https://lists.neo4j.org/mailman/listinfo/user > >>> _______________________________________________ > >>> Neo4j mailing list > >>> User@lists.neo4j.org > >>> https://lists.neo4j.org/mailman/listinfo/user > >>> > >> > >> > >> > >> -- > >> Mattias Persson, [matt...@neotechnology.com] > >> Hacker, Neo Technology > >> www.neotechnology.com > >> _______________________________________________ > >> Neo4j mailing list > >> User@lists.neo4j.org > >> https://lists.neo4j.org/mailman/listinfo/user > > > > _______________________________________________ > > Neo4j mailing list > > User@lists.neo4j.org > > https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > > Neo4j mailing list > > User@lists.neo4j.org > > https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user