Re: [Neo] basic questions

Craig Taverner Wed, 17 Mar 2010 15:10:44 -0700

I was reminded recently of this 'context switch' between relational and
graph when I asked a group of developers I was training how to structure the
nodes behind a table of results, where each result was a number representing
the number of other results conforming to some criteria. So basically a
simple count, but we wanted to be able to click on the cell and get the list
of original data. The developers answered that they would store the node-ids
of the original data in an int[] property. That answer totally took me by
surprise, and it took a few seconds for me to realize that it would work for
the initial use case, but was not a graph-like solution, and obviously more
like a relational database foreign key.


Not being a graphy solution was not a good enough reason for me to turn down
the idea, so I had to think a bit further to argue my point. Simply
replacing the ids with relationships to the original nodes (as I originally
planned), was not going to give any further functionality. Instead it would
slow down performance slightly (due to increased structure). Then it hit me,
the power of the graph was that we can continue to add structure, coping
with far more use cases than a simple foreign key. And the next key piece of
structure I immediately thought of was a layer of intermediate nodes
representing an intermediate level of statistics. Suddenly we could add any
number of levels of detail to the chain, arbitrarily, without affecting the
user interface or the initial use case. That was certainly not possible with
the foreign key approach.

But I also thanked the developer for the suggestion, for without it I would
not have been forced to find a better argument for the graphy solution, and
thereby solved a few other requirements :-)

On Wed, Mar 17, 2010 at 10:53 PM, Tim Langley <timlang...@me.com> wrote:

> Good point :)
> In training our new devs (who typically have a RDBC background) it
> often takes a leap of understanding for them to realise that whilst a
> join is very expensive, creating and traversing a relationship is very
> cheap
>
> T
>
> Sent from my iPhone so please excuse typos adn brevity
>
> Tim Langley
> +44 7989 539363
>
> On 17 Mar 2010, at 21:50, Lincoln <linxbet...@gmail.com> wrote:
>
> > Ok I get it.  I keep not thinking about relationships flexibly
> > enough.  I
> > need to take a little time to get used to thinking about problems
> > this way.
> > Thanks for your help!
> >
> > On Wed, Mar 17, 2010 at 5:43 PM, Craig Taverner <cr...@amanzi.com>
> > wrote:
> >
> >> I guess I could say that the approach is totally different, but in
> >> reality
> >> you start at the same point, working out what query you want. But
> >> after
> >> that
> >> things change. Let's consider two cases, the huge network and the
> >> paging
> >> examples you used below.
> >>
> >> For the first, keep in mind that if you scale the total database
> >> size,
> >> relational db's don't scale as well as graph db's, because in
> >> relational
> >> db's performance is related to total table size, whereas graph db's
> >> performance is not. However, both db's are affected by the result
> >> set size.
> >> So for twitter type cases, graph db's scale better with total
> >> subscriber
> >> base, but not necessarily for an individual 'mega-user'. But let's
> >> look at
> >> that case. I believe a real human user would never be able to process
> >> millions of messages a day, so I must assume that this is a bot use
> >> case,
> >> perhaps a script that subscribes to enormous numbers of people
> >> scanning for
> >> particular patterns of information. Both relational and graph db's
> >> will
> >> feel
> >> the strain on a huge resultset. Personally I would probably follow
> >> twitters
> >> example and return only paginated result, breaking up the load so
> >> the bot
> >> would not negatively impact the network for human users.
> >>
> >> So, getting to pagination, the query you want is perhaps 'give me the
> >> ordered results, as a block of n messages starting at message x'.
> >> Obviously
> >> this is easily translatable into SQL for relational databases. For
> >> the
> >> graph
> >> database, it is a little more work, but again you start with that
> >> verbal
> >> query and figure out what it really means in a graph. There are two
> >> key
> >> components:
> >>
> >>  - Ordered results
> >>  - Starting point and number of results
> >>
> >> For SQL this translates to 'order by', 'where id > x' and 'limit
> >> n'. For
> >> graph databases this translates into adding ordering relationships
> >> like
> >> message(X)-->PREVIOUS-->message(X+1). On each pagination, you can
> >> take the
> >> PREVIOUS message as the begining of the next pagination (possibly
> >> passing
> >> the node id in the session). Each pagination is simply a traverser
> >> that
> >> runs
> >> down the PREVIOUS chain, exiting the loop after n messages.
> >>
> >>
> >> On Wed, Mar 17, 2010 at 10:19 PM, Lincoln <linxbet...@gmail.com>
> >> wrote:
> >>
> >>> Wow dude, this is blowing my mind just a little.
> >>>
> >>> Ok, sticking with the twitter example, I'm concerned about the edge
> >> cases.
> >>> I'd say it's easy to optimize with a relational db or any other
> >>> storage
> >>> for
> >>> that matter if I make the assumption that people only follow a few
> >> hundred
> >>> people and only want recent messages.  However some people follow
> >> hundreds
> >>> of thousands of people.  If Guy Kawasaki uses my app, I'd run into a
> >>> problem
> >>> quickly.
> >>>
> >>> However I see your point that I don't have to limit myself to just
> >>> the
> >>> obvious relationships, but can create relationships that serve
> >>> specific
> >>> purposes and use-cases such as your day example.  I'm not sure how I
> >> would
> >>> want to model my use-case to allow for Guy Kawaski, I'll have to
> >>> think
> >> more
> >>> about it.  Is there a threshold beyond which adding relationships
> >>> between
> >>> nodes causes problems?  If not, or if it's high, you could create
> >>> custom
> >>> relationships for every type of query you'd want to do.
> >>>
> >>> However, a secondary question comes up.  If we continue with the
> >>> twitter
> >>> example, and I want to be able to page through results, is that
> >>> directly
> >>> supported through Neo4j's API?  Coming from a more traditional
> >>> storage
> >>> background I tend to think of what I'd want as a sort by time and
> >>> then a
> >>> skip and limit on the results (so I could say give me messages 1-100
> >> sorted
> >>> by time descending).  Is there anything equivalent in Neo4j or is
> >>> the
> >>> approach totally different?
> >>>
> >>> Thanks,
> >>> Lincoln
> >>>
> >>>
> >>> On Wed, Mar 17, 2010 at 12:41 PM, Craig Taverner <cr...@amanzi.com>
> >> wrote:
> >>>
> >>>> Hi Lincoln,
> >>>>
> >>>> So it sounds like you don't need the IS_VISIBLE relations after
> >>>> all.
> >> The
> >>>> traverser works by following all relationships of the specified
> >>>> types
> >> and
> >>>> directions from each current node (as you traverse, or walk the
> >>>> graph).
> >>> You
> >>>> can have a complex graph and traverse to high depth very fast
> >> (thousands
> >>> of
> >>>> relationships per second). The traverser will also automatically
> >>>> check
> >>> that
> >>>> the same node is not returned twice. The test for the
> >>>> relationship type
> >>> is
> >>>> efficient. Still reasonable, but less efficient is the custom
> >>>> test you
> >>>> might
> >>>> put in the returnable evaluator, but if the limiting factor is
> >>>> usually
> >>> the
> >>>> number of relationships traversed, and if that is kept managable,
> >>>> the
> >>>> evaluator test is no concern.
> >>>>
> >>>> I think twitter is a good case in point, even with many millions of
> >>> users,
> >>>> you will still only follow perhaps a hundred and they will tweet
> >> perhaps
> >>> a
> >>>> hundred, or a thousand times, so your traverser will find the
> >>>> 10k-100k
> >>>> messages quite quickly. This can be speeded up further, but the
> >>>> right
> >>>> approach depends again on your use case. The idea with using a
> >>>> graph
> >>>> database is that the actual usage probably maps very well to the
> >>>> graph
> >>>> structure, so when deciding how to speed up your search, consider
> >>>> how
> >> it
> >>>> will be used. In twitter one normally only cares about recent
> >>>> messages,
> >>> so
> >>>> how about not linking directly from the user to the message, but
> >>>> link
> >> to
> >>> an
> >>>> intermediate node representing time, for example, a day-node.
> >>>> Then each
> >>> new
> >>>> message is added to the day node for that day, and that will
> >>> automatically
> >>>> become yesterday the next day. Then your traversal can have a stop
> >>>> evaluator
> >>>> to not follow old messages (unless your query is looking for old
> >>> messages,
> >>>> of course). So the 100k messages might drop to only a few
> >>>> hundred, or
> >>> even
> >>>> just a few dozen. Certainly that will be a query of the order of
> >>>> milliseconds!
> >>>>
> >>>> Moving away from the traverser, you also have the option to call
> >> directly
> >>>> the getRelationships() methods from the node. If you structure is
> >>>> predictable, like viewer-->FOLLOWS-->user-->CREATED-->message,
> >>>> then two
> >>>> nested for loops would work, the outer iterating over the
> >>>> followers and
> >>> the
> >>>> inner iterating over the messages. If you changed to add a time-
> >>>> based
> >>>> interim node (which is a kind of graph-index), then you need to
> >>>> have
> >>> three
> >>>> loops. If you made your time index a deeper tree (months->days-
> >>>> >hours,
> >>>> etc.), then you would need to further refactor the code. However,
> >>>> if
> >> you
> >>>> stuck with a traverser, you might not need to change the
> >>>> traverser even
> >>> of
> >>>> the graph structure changed, as long as the same relationship types
> >> were
> >>>> maintained. Does that make sense?
> >>>>
> >>>> Cheers, Craig
> >>>>
> >>>> On Wed, Mar 17, 2010 at 4:00 PM, Lincoln <linxbet...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Thanks Craig,
> >>>>>
> >>>>> I'd like to clarify my question (I don't think it changes your
> >>>>> answer
> >>>>> though).
> >>>>>
> >>>>> I wanted all messages visible to me created by users I follow.
> >>>>> Thus,
> >>> the
> >>>>> FOLLOWS relationship is not enough.  I'd need to see messages that
> >> are
> >>>>> visible to me and then check if they were created by users I
> >>>>> follow,
> >> or
> >>>> I'd
> >>>>> need to see messages created by users I follow and then see if
> >> they're
> >>>>> visible to me.
> >>>>>
> >>>>> I assume your last example still yields the result I'm looking
> >>>>> for.
> >>>> Could
> >>>>> you describe what actually happens here though?  I'm unclear on
> >>>>> what
> >>> the
> >>>>> traversal looks like.  Would it first traverse every outgoing
> >>>>> FOLLOWS
> >>>>> relationship from the viewer, yielding other users, and then
> >>>>> traverse
> >>> all
> >>>>> the CREATED relationships to get to messages?
> >>>>>
> >>>>> Also, given very large numbers of FOLLOWS and CREATED
> >>>>> relationships
> >>> (with
> >>>>> say, a twitter graph), how is this made efficient?
> >>>>>
> >>>>> Sorry for all the basic questions but I couldn't find this
> >> information
> >>> in
> >>>>> the docs.  If there's something I should be reading before posting
> >>> these
> >>>>> questions, please point me to it.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Lincoln
> >>>>>
> >>>>> On Wed, Mar 17, 2010 at 7:06 AM, Craig Taverner <cr...@amanzi.com>
> >>>> wrote:
> >>>>>
> >>>>>> I'm uncertain about one ambiguity in your model, you are able to
> >> find
> >>>>>> messages through FOLLOWS and IS_VISIBLE_BY. These will give two
> >>>> different
> >>>>>> sets, and my first impression was that FOLLOWS gives you the
> >>>>>> right
> >>>>> answer.
> >>>>>> In other words you want to query for 'all messages by users I
> >>> follow'?
> >>>> In
> >>>>>> that case you do not need IS_VISIBLE_BY. However, if there are
> >>> messages
> >>>>> by
> >>>>>> people you follow, but are not allowed to see, then you also need
> >> the
> >>>>>> IS_VISIBLE_BY. But I would still reconsider linking directly from
> >> the
> >>>>>> viewer
> >>>>>> to the message for that case. I'd rather have the messages linked
> >> to
> >>>> some
> >>>>>> categorization structure for things like 'public', 'private',
> >>>>>> etc.
> >>>>>>
> >>>>>> Anyway, here are some suggestions for the various approaches
> >>>>>> above:
> >>>>>> *'all messages by users I follow'*
> >>>>>> val msgs = viewer.traverse(
> >>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> >>>>>> (tp: TraversalPosition) => IsMessage(tp.currentNode()),
> >>>>>> Rels.FOLLOWS, Direction.OUTGOING,
> >>>>>> Rels.CREATED, Direction.OUTGOING)
> >>>>>>
> >>>>>> *'all messages visible to me'*
> >>>>>> val msgs = viewer.traverse(
> >>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> >>>>>> ReturnableEvaluator.ALL_BUT_START_NODE,
> >>>>>> Rels.IS_VISIBLE_BY, Direction.INCOMING)
> >>>>>>
> >>>>>> *'all messages, visible to me, by people I follow'*
> >>>>>> val msgs = viewer.traverse(
> >>>>>> Order.BREADTH_FIRST, StopEvaluator.END_OF_GRAPH,
> >>>>>> (tp: TraversalPosition) => {
> >>>>>> val msg = tp.currentNode()
> >>>>>> IsMessage(msg) && IsVisibleBy(msg,viewer)
> >>>>>> },
> >>>>>> Rels.FOLLOWS, Direction.OUTGOING,
> >>>>>> Rels.CREATED, Direction.OUTGOING)
> >>>>>>
> >>>>>> Of course I assume you make the utility functions IsMessage(node:
> >>> Node)
> >>>>> and
> >>>>>> IsVisibleBy(msg: Node, user: Node), and these will test the
> >> existance
> >>>> of
> >>>>>> properties and relations as appropriate to make the decision.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Mar 17, 2010 at 6:32 AM, Lincoln <linxbet...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>>> Hi, I've just started looking at Neo4j and I'm quite intrigued.
> >>>>> However,
> >>>>>>> the cognitive dissonance that I've grown so used to in modeling
> >>>> storage
> >>>>>> is
> >>>>>>> proving to be a bit difficult to let go at this early stage :)
> >>>>>>>
> >>>>>>> I was hoping that if someone could help me through an example I
> >>> would
> >>>>> be
> >>>>>>> able to grok how to properly structure my data and query it in
> >>> Neo4j.
> >>>>>>>
> >>>>>>> Nodes:
> >>>>>>> Message( text: String )
> >>>>>>> User( id: Long )
> >>>>>>>
> >>>>>>> Relationships:
> >>>>>>> CREATED
> >>>>>>> FOLLOWS
> >>>>>>> IS_VISIBLE_BY
> >>>>>>>
> >>>>>>> So I might have a graph with entries like so:
> >>>>>>>
> >>>>>>> User(1) --> CREATED --> Message("i woke up late today")
> >>>>>>> User(2) --> CREATED --> Message("hello")
> >>>>>>> User(3) --> CREATED --> Message("ugh, i hate mondays")
> >>>>>>>
> >>>>>>> User(1) --> FOLLOWS --> User(2)
> >>>>>>>
> >>>>>>> Let's also say all messages are visible to User 1.
> >>>>>>>
> >>>>>>> Message("i woke up late today") --> IS_VISIBLE_BY --> User(1)
> >>>>>>> Message("hello") --> IS_VISIBLE_BY --> User(1)
> >>>>>>> Message("ugh, i hate mondays") --> IS_VISIBLE_BY --> User(1)
> >>>>>>>
> >>>>>>> So, I can do a simple traversal for visible:
> >>>>>>>
> >>>>>>> val graphDb = new EmbeddedGraphDatabase( "path/to/neo4j-db" )
> >>>>>>> val index = new LuceneIndexService( graphDb )
> >>>>>>> val viewer = index.getSingleNode("id", 1)
> >>>>>>> val msgs = viewer.traverse( Order.BREADTH_FIRST,
> >>>>>>> StopEvaluator.END_OF_GRAPH,
> >>>>>>> ReturnableEvaluator.ALL_BUT_START_NODE, Rels.IS_VISIBLE_BY,
> >>>>>>> Direction.INCOMING)
> >>>>>>> msgs.toList.map(_.toJson).mkString("{ msgs : [", ",", "] }")  //
> >>>>> assuming
> >>>>>> i
> >>>>>>> have the relevant functions
> >>>>>>>
> >>>>>>> But let's say that this is going to return too many messages.
> >> Just
> >>>>>> because
> >>>>>>> all the messages are possibly visible to me, doesn't mean I want
> >> to
> >>>> see
> >>>>>>> them
> >>>>>>> all.  So, I'd like to additionally filter by the FOLLOWS
> >>>> relationship.
> >>>>>> I'd
> >>>>>>> like to express "get all messages that are visible and were
> >> created
> >>>> by
> >>>>> a
> >>>>>>> user that I follow."  Can someone show me an example of how to
> >>>>>>> do
> >>>> that?
> >>>>>>>
> >>>>>>> I'm guessing that you need to implement a custom
> >>> ReturnableEvaluator,
> >>>>> but
> >>>>>> I
> >>>>>>> don't understand how you traverse multiple relationships at the
> >>> same
> >>>>>> time.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Lincoln
> >>>>>>> _______________________________________________
> >>>>>>> Neo mailing list
> >>>>>>> User@lists.neo4j.org
> >>>>>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> Neo mailing list
> >>>>>> User@lists.neo4j.org
> >>>>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>>>
> >>>>> _______________________________________________
> >>>>> Neo mailing list
> >>>>> User@lists.neo4j.org
> >>>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>>
> >>>> _______________________________________________
> >>>> Neo mailing list
> >>>> User@lists.neo4j.org
> >>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>
> >>> _______________________________________________
> >>> Neo mailing list
> >>> User@lists.neo4j.org
> >>> https://lists.neo4j.org/mailman/listinfo/user
> >>>
> >> _______________________________________________
> >> Neo mailing list
> >> User@lists.neo4j.org
> >> https://lists.neo4j.org/mailman/listinfo/user
> >>
> > _______________________________________________
> > Neo mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> _______________________________________________
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] basic questions

Reply via email to