Awesome inputs, Craig.

That's very similar to how we were considering it - the indexing part is the
piece that I'm having a tough time getting my head around.  If I had a
timeline-oriented index, I'd need to:

a) Find the start point (either start time for a forward-oriented traversal
or end time for a backcward-oriented traversal)

b) Traverse in time order

I'll take a look at the timeline example to see if that can do what we need.

Because we have multiple types of log entries, we were planning to have two
nodes per entry - one for the "common" properties (timestamp, source,
category, description, etc.) and one for the "content" (event specific
properties) with simple a relationship between them.  In some cases, we
might even have multiple event entries for one "common" node (e.g. a
multi-valued sensor).  Neo deals with this nicely for us.

We were then planning to maintain relationships from the common properties
(which are what will generally be most effective for quickly reducing the
size of the set of nodes from many millions to a few thousand) to the
source/category nodes they refer to.  Sort of like a map/reduce scenario
updated as nodes are created.

We will probably then have a Hadoop or other mechanism to "crawl" and do a
map/reduce on the event-specific content as needed to maintain
indexes/relationships for specific queries.

Lastly, we built an engine in Java to deal with SQL-like filtering, sorting,
and aggregation of the event node data once we have the set of possible
candidates based on time range, source, and category.

I am confident we'll find a way to strike a balance between inbound
performance (cost of setting up the relationships and indexes as
nodes/events are added) versus query performance.

Thanks again for sharing your experiences!

Rick

-----Original Message-----
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Craig Taverner
Sent: Thursday, December 10, 2009 10:51 AM
To: Neo user discussions
Subject: Re: [Neo] Noob questions/comments

I'd like to comment on the question of indexing event logs. We are also
dealing with event logs and we query them based on time windows as well as
other properties. In our case they are also located on a map, and
categorized using event types. So we have three types of indexes in use:

   - time index (using long values, similar to the TimelineIndex Johan
   mentioned, but we coded one ourselves)
   - spatial index (using two double values, based on the same mechanism for
   the time index, but in 2D)
   - category index (we just create a list of category nodes and link the
   events to the relevant category)

All of these indexes are simply nodes that the event stream nodes link to.
For numerical indexes like the time and spatial indexes we use tree
structures (not B-tree, but usually multiple children per-parent, suitable
for uneven density data). The first level of the tree (closest to the data)
is chosen with a resolution close to common queries (in our case the events
occur many times per second, but queries are usually at second resolution,
so the first index is of second resolution).

You had a very important question about combined indexes, for example
querying on timestamp and category in the same query with high performance.
Currently we do not have need for that in our system, but we have
brainstormed a nice solution to this, so I thought I'd mention it here in
case it is useful. There are two options:

   - If one of the criteria is very limiting all the time, for example
   querying on time-window always returns a small set, then query that first
   and do a slow search for the linked categories. This adds no additional
   complexity to the database, but makes assumptions about the queries, and
   only performs well if these assumptions are true.
   - Otherwise you can build an combined index by connecting the tree nodes
   of one index to the tree nodes of another. In the case of time and
category
   indices, each of the nodes in the B-tree or multi-tree time index would
be
   connected to all the categories for which its underlying data nodes
belong.
   Then when traversing the time tree, you can test for both the time-window
   constraints and the category constraints, and exit the search if either
   fail. We have considered the possibility of building these structures on
   demand, based on actual queries, so the first query that works with any
two
   constraints would search on one, and then build the combined index for
both.
   This allows subsequent searches to run very fast, without needing to
build
   all possible combinations of combined index (assuming many single
property
   indices exist).

-          One aspect of our application will store nodes that can be
> considered similar to event logs.  There may be many thousands of these
> nodes per "event stream".  We would like to be able to traverse the
entries
> in chronological order, very quickly.  We were considering the following
> design possibilities:
>
> o   Simply create a node for each "stream" and a node for each entry, with
> a
> relationship between the stream and the entry, then implement our own sort
> routine
>

Our approach is to create a node for each entry, and index using time and
spatial indices. The first level of index is another stream of data, ordered
by the relevant property, and traversable in that order (eg. time order).

o   Similar to the above, but create a node for each "day", and manage
> relationships to allow traversal by stream and/or day
>

In our approach, each level in the index tree represents a higher level of
granularity. We go up in fixed steps (multiples). A B-tree steps 2X. We tend
to step 10X, because that gives isosceles pyramid trees. But you might
prefer to step in known temporal quantities, seconds, minutes, hours, days,
weeks, months, etc. That will improve search performance if your common
queries are exact multiples of the different index levels.


> o   Create a node for each stream, a node for each entry and treat the
> entries as a forward-only linked list using relationships between the
> entries (and of course a relationship between the stream and the "first"
> entry)
>

We tend to create relationships for all common query or traversal paths,
with different relationship types in all cases. So traversing the original
data would use 'next'. Traversing down the index would be 'child', or
perhaps 'index-child' if there is ambiguity. etc.

-          Anyone used any kind of intermediate index or other approach to
> bridge multiple Neo instances?
>

Hmm... I think it was this question that got me started on the combined
index discussion above, but now that I re-read it, I see it has nothing to
do with combined indices. I've thought a bit about bridging indices, but
have nothing really useful to offer here. Sorry. Hope the long discussion
above still has some value :-(
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to