Awesome inputs, Craig. That's very similar to how we were considering it - the indexing part is the piece that I'm having a tough time getting my head around. If I had a timeline-oriented index, I'd need to:
a) Find the start point (either start time for a forward-oriented traversal or end time for a backcward-oriented traversal) b) Traverse in time order I'll take a look at the timeline example to see if that can do what we need. Because we have multiple types of log entries, we were planning to have two nodes per entry - one for the "common" properties (timestamp, source, category, description, etc.) and one for the "content" (event specific properties) with simple a relationship between them. In some cases, we might even have multiple event entries for one "common" node (e.g. a multi-valued sensor). Neo deals with this nicely for us. We were then planning to maintain relationships from the common properties (which are what will generally be most effective for quickly reducing the size of the set of nodes from many millions to a few thousand) to the source/category nodes they refer to. Sort of like a map/reduce scenario updated as nodes are created. We will probably then have a Hadoop or other mechanism to "crawl" and do a map/reduce on the event-specific content as needed to maintain indexes/relationships for specific queries. Lastly, we built an engine in Java to deal with SQL-like filtering, sorting, and aggregation of the event node data once we have the set of possible candidates based on time range, source, and category. I am confident we'll find a way to strike a balance between inbound performance (cost of setting up the relationships and indexes as nodes/events are added) versus query performance. Thanks again for sharing your experiences! Rick -----Original Message----- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Craig Taverner Sent: Thursday, December 10, 2009 10:51 AM To: Neo user discussions Subject: Re: [Neo] Noob questions/comments I'd like to comment on the question of indexing event logs. We are also dealing with event logs and we query them based on time windows as well as other properties. In our case they are also located on a map, and categorized using event types. So we have three types of indexes in use: - time index (using long values, similar to the TimelineIndex Johan mentioned, but we coded one ourselves) - spatial index (using two double values, based on the same mechanism for the time index, but in 2D) - category index (we just create a list of category nodes and link the events to the relevant category) All of these indexes are simply nodes that the event stream nodes link to. For numerical indexes like the time and spatial indexes we use tree structures (not B-tree, but usually multiple children per-parent, suitable for uneven density data). The first level of the tree (closest to the data) is chosen with a resolution close to common queries (in our case the events occur many times per second, but queries are usually at second resolution, so the first index is of second resolution). You had a very important question about combined indexes, for example querying on timestamp and category in the same query with high performance. Currently we do not have need for that in our system, but we have brainstormed a nice solution to this, so I thought I'd mention it here in case it is useful. There are two options: - If one of the criteria is very limiting all the time, for example querying on time-window always returns a small set, then query that first and do a slow search for the linked categories. This adds no additional complexity to the database, but makes assumptions about the queries, and only performs well if these assumptions are true. - Otherwise you can build an combined index by connecting the tree nodes of one index to the tree nodes of another. In the case of time and category indices, each of the nodes in the B-tree or multi-tree time index would be connected to all the categories for which its underlying data nodes belong. Then when traversing the time tree, you can test for both the time-window constraints and the category constraints, and exit the search if either fail. We have considered the possibility of building these structures on demand, based on actual queries, so the first query that works with any two constraints would search on one, and then build the combined index for both. This allows subsequent searches to run very fast, without needing to build all possible combinations of combined index (assuming many single property indices exist). - One aspect of our application will store nodes that can be > considered similar to event logs. There may be many thousands of these > nodes per "event stream". We would like to be able to traverse the entries > in chronological order, very quickly. We were considering the following > design possibilities: > > o Simply create a node for each "stream" and a node for each entry, with > a > relationship between the stream and the entry, then implement our own sort > routine > Our approach is to create a node for each entry, and index using time and spatial indices. The first level of index is another stream of data, ordered by the relevant property, and traversable in that order (eg. time order). o Similar to the above, but create a node for each "day", and manage > relationships to allow traversal by stream and/or day > In our approach, each level in the index tree represents a higher level of granularity. We go up in fixed steps (multiples). A B-tree steps 2X. We tend to step 10X, because that gives isosceles pyramid trees. But you might prefer to step in known temporal quantities, seconds, minutes, hours, days, weeks, months, etc. That will improve search performance if your common queries are exact multiples of the different index levels. > o Create a node for each stream, a node for each entry and treat the > entries as a forward-only linked list using relationships between the > entries (and of course a relationship between the stream and the "first" > entry) > We tend to create relationships for all common query or traversal paths, with different relationship types in all cases. So traversing the original data would use 'next'. Traversing down the index would be 'child', or perhaps 'index-child' if there is ambiguity. etc. - Anyone used any kind of intermediate index or other approach to > bridge multiple Neo instances? > Hmm... I think it was this question that got me started on the combined index discussion above, but now that I re-read it, I see it has nothing to do with combined indices. I've thought a bit about bridging indices, but have nothing really useful to offer here. Sorry. Hope the long discussion above still has some value :-( _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user