Re: [Neo] Noob questions/comments

Craig Taverner Thu, 10 Dec 2009 08:49:39 -0800

Hi Nick,

Your system sounds super-cool. Must be fun and challenging to deal with the
clustering issues as well. We have though a little about this for future
projects but have no immediate needs.


I have some further comments regarding the balance you need to strike
between index search speed and the need to update the index as the data
streams in. We also have that need, which is probably the main reason I
still have my own home-grown index. Unlike most 'official' indexing
algorithms, I build my index from the ground up. This means the index comes
into existence as data is loaded, and only areas containing data actually
include index nodes. So, for example, if we have data coming in every few
milliseconds for a while, then there is a big gap of days or weeks, then the
data flows again, the gap will not contain any index information. The tree
also only grows as high as needed to cover the full range, so if only a
small range of data is indexed, the complete tree is small, and the top node
only covers that range. This bottom-up approach is very useful for streaming
data, since the index is driven by the stream. It also performs well if the
data coming in is already to some extent in the right order, because finding
the right index nodes to attach too is fast (often still in memory). Our
data streams are produced by people driving along roads, so the time and
location indexes are fast to keep up to date (each event is close to the
previous event in time and space).

I've investigated a few other indexing algorithms, but usually they are
designed to cope with bulk indexing data that is entirely un-ordered, and
that 're-index' step is usually quite costly.

I'm not using neo4j's own timeline index, but my quick glance at it makes it
look like it probably also works for streaming data.

We will continue to look for improvements to the balancing act, and we
should keep in touch on that question.

Regards, Craig

On Thu, Dec 10, 2009 at 5:02 PM, Rick Bullotta <
rick.bullo...@burningskysoftware.com> wrote:

> Awesome inputs, Craig.
>
> That's very similar to how we were considering it - the indexing part is
> the
> piece that I'm having a tough time getting my head around.  If I had a
> timeline-oriented index, I'd need to:
>
> a) Find the start point (either start time for a forward-oriented traversal
> or end time for a backcward-oriented traversal)
>
> b) Traverse in time order
>
> I'll take a look at the timeline example to see if that can do what we
> need.
>
> Because we have multiple types of log entries, we were planning to have two
> nodes per entry - one for the "common" properties (timestamp, source,
> category, description, etc.) and one for the "content" (event specific
> properties) with simple a relationship between them.  In some cases, we
> might even have multiple event entries for one "common" node (e.g. a
> multi-valued sensor).  Neo deals with this nicely for us.
>
> We were then planning to maintain relationships from the common properties
> (which are what will generally be most effective for quickly reducing the
> size of the set of nodes from many millions to a few thousand) to the
> source/category nodes they refer to.  Sort of like a map/reduce scenario
> updated as nodes are created.
>
> We will probably then have a Hadoop or other mechanism to "crawl" and do a
> map/reduce on the event-specific content as needed to maintain
> indexes/relationships for specific queries.
>
> Lastly, we built an engine in Java to deal with SQL-like filtering,
> sorting,
> and aggregation of the event node data once we have the set of possible
> candidates based on time range, source, and category.
>
> I am confident we'll find a way to strike a balance between inbound
> performance (cost of setting up the relationships and indexes as
> nodes/events are added) versus query performance.
>
> Thanks again for sharing your experiences!
>
> Rick
>
> -----Original Message-----
> From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org]
> On
> Behalf Of Craig Taverner
> Sent: Thursday, December 10, 2009 10:51 AM
> To: Neo user discussions
> Subject: Re: [Neo] Noob questions/comments
>
> I'd like to comment on the question of indexing event logs. We are also
> dealing with event logs and we query them based on time windows as well as
> other properties. In our case they are also located on a map, and
> categorized using event types. So we have three types of indexes in use:
>
>   - time index (using long values, similar to the TimelineIndex Johan
>   mentioned, but we coded one ourselves)
>   - spatial index (using two double values, based on the same mechanism for
>   the time index, but in 2D)
>   - category index (we just create a list of category nodes and link the
>   events to the relevant category)
>
> All of these indexes are simply nodes that the event stream nodes link to.
> For numerical indexes like the time and spatial indexes we use tree
> structures (not B-tree, but usually multiple children per-parent, suitable
> for uneven density data). The first level of the tree (closest to the data)
> is chosen with a resolution close to common queries (in our case the events
> occur many times per second, but queries are usually at second resolution,
> so the first index is of second resolution).
>
> You had a very important question about combined indexes, for example
> querying on timestamp and category in the same query with high performance.
> Currently we do not have need for that in our system, but we have
> brainstormed a nice solution to this, so I thought I'd mention it here in
> case it is useful. There are two options:
>
>   - If one of the criteria is very limiting all the time, for example
>   querying on time-window always returns a small set, then query that first
>   and do a slow search for the linked categories. This adds no additional
>   complexity to the database, but makes assumptions about the queries, and
>   only performs well if these assumptions are true.
>   - Otherwise you can build an combined index by connecting the tree nodes
>   of one index to the tree nodes of another. In the case of time and
> category
>   indices, each of the nodes in the B-tree or multi-tree time index would
> be
>   connected to all the categories for which its underlying data nodes
> belong.
>   Then when traversing the time tree, you can test for both the time-window
>   constraints and the category constraints, and exit the search if either
>   fail. We have considered the possibility of building these structures on
>   demand, based on actual queries, so the first query that works with any
> two
>   constraints would search on one, and then build the combined index for
> both.
>   This allows subsequent searches to run very fast, without needing to
> build
>   all possible combinations of combined index (assuming many single
> property
>   indices exist).
>
> -          One aspect of our application will store nodes that can be
> > considered similar to event logs.  There may be many thousands of these
> > nodes per "event stream".  We would like to be able to traverse the
> entries
> > in chronological order, very quickly.  We were considering the following
> > design possibilities:
> >
> > o   Simply create a node for each "stream" and a node for each entry,
> with
> > a
> > relationship between the stream and the entry, then implement our own
> sort
> > routine
> >
>
> Our approach is to create a node for each entry, and index using time and
> spatial indices. The first level of index is another stream of data,
> ordered
> by the relevant property, and traversable in that order (eg. time order).
>
> o   Similar to the above, but create a node for each "day", and manage
> > relationships to allow traversal by stream and/or day
> >
>
> In our approach, each level in the index tree represents a higher level of
> granularity. We go up in fixed steps (multiples). A B-tree steps 2X. We
> tend
> to step 10X, because that gives isosceles pyramid trees. But you might
> prefer to step in known temporal quantities, seconds, minutes, hours, days,
> weeks, months, etc. That will improve search performance if your common
> queries are exact multiples of the different index levels.
>
>
> > o   Create a node for each stream, a node for each entry and treat the
> > entries as a forward-only linked list using relationships between the
> > entries (and of course a relationship between the stream and the "first"
> > entry)
> >
>
> We tend to create relationships for all common query or traversal paths,
> with different relationship types in all cases. So traversing the original
> data would use 'next'. Traversing down the index would be 'child', or
> perhaps 'index-child' if there is ambiguity. etc.
>
> -          Anyone used any kind of intermediate index or other approach to
> > bridge multiple Neo instances?
> >
>
> Hmm... I think it was this question that got me started on the combined
> index discussion above, but now that I re-read it, I see it has nothing to
> do with combined indices. I've thought a bit about bridging indices, but
> have nothing really useful to offer here. Sorry. Hope the long discussion
> above still has some value :-(
> _______________________________________________
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] Noob questions/comments

Reply via email to