Re: [Neo] Traversal Speed is just 1 millisecond per node
Hi, I understand the graph layout as this: (publisher node)<--PUBLISHED_BY---(book)<--BORROWED_BY--(student) There are other relationships between books and students (RETURNED_BY,RESERVED_BY) but the relationship count on a book node will still be low compared to the publisher node. Correct? On a machine with 8GB RAM and that graph calculating "amount of borrowed books" for a publisher or "get amount of books never borrowed" for a publisher should execute in under 100ms given your traversal speed is 300+ hops/ms (in 30ms with 1000 hops/ms). Normal cached traversal speed on any modern hardware should be 1000+ hops/ms. The poor performance when performing the depth two traversal to get the amount of borrowed relationships on a publisher is indeed very strange. A possible reason could be that the traverser is traversing more than it should. For example if you create a traverser that traverses the PUBLISHED_BY and BORROWED_BY relationships in both directions you would traverse (publisher)<-(book)<-(student)->(other book)->(other or same publisher)<-(other book)... and so on (and that would take time). Also make sure you are not measuring number of "borrowed books traversal speed" over all publishers instead of a single publisher (retrieving borrowed books from all publishers with a low borrowed ratio would result in a low "borrowed books/ms" value). Could you write a simple traverser using the core API that calculates the relationship count at depth two given a publisher node. Something like: Node publisher = // get a publisher node int count = 0; for ( Relationship bookRel : publisher.getRelationships( RelTypes.PUBLISHED_BY ) ) { count++; Node book = bookRel.getOtherNode( publisher ); for ( Relationship studentRel : book.getRelationships( RelTypes.BORROWED_BY ) ) { count++; } } Avg count and time to perform that operation wold be interesting. If timings are not in expected range (basically 15k [avg number of rels/publisher] x 2 [the borrowed_by rel] x traversal speed) try get the total relationship count by replacing getRelationships(type) with getRelationships() to get worst case scenario on how much traversing needs to be done. Regards, -Johan On Sun, May 16, 2010 at 4:08 AM, suryadev vasudev wrote: > Here is a rough design and volume > > ... > > When I first read about traversal speed of 1000-3000/millisecond, I added > some buffer and assumed 500/millisecond as a realistic speed. I am not > giving up so easily after seeing 1/millisecond. I look forward to responses > from other users. > > The real challenges will be around queries for a publisher. A publisher will > have around 15,000 books and a query like "Given a published ID, what > percentage of his books were never borrowed" will need full browsing. My > hope was that I could browse through and get the answer in 30 milliseconds. > But it looks like it will take a minimum of 15 seconds. > > Some publishers will have 50,000 books and I can't imagine a response time > of 50 seconds. > > So, I have to achieve at least 500/millisecond if not the original 1000. > > Regards > SDev ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Traversal Speed is just 1 millisecond per node
Here is a rough design and volume NODE:PUBLISHER (1000 publishers) published_id Publisher_name Publisher address Publisher City Publisher State Publisher Country Publisher Primary Email Publisher URL NODE:STUDENT (15,000 students) Student_id Student_first_name Student_last_name Student_registration_date Student_course_completion_date Student_Email_id NODE:BOOK (15 million books) Book_id Book_ISBN Book_name Book_Primary_Author Book_Secondary_Author Book_Published_year Book_subject RELATIONSHIP: PUBLISHED_BY Purchase_date Purchase_approved_by Purchase_contract_number RELATIONSHIP: BORROWED_BY borrowed_date due_date RELATIONSHIP: RETURNED_BY borrowed_date due_date returned_date due_amount_paid RELATIONSHIP: RESERVED_BY reservation_date The BORROWED_BY relationship is maintained for an active borrowing. This relationship is deleted and RETURNED_BY relationship is created when book is returned. So there can be a maximum of one BORROWED_BY relationship for any one book. Off course there will be more than one RETURNED_BY for a book. Many students can reserve the book at any time. All will get a email when a book is returned. The application is expected to provide dashboard services and analytical reports Student dashboard: All books borrowed, returned and reserved by a student for a date range Book dashboard: Lending history of a book for a given date range Publisher dashboard: All books for a particular publisher, lending history Librarian dashboard: Lending activities for a given date range (by publisher, by hour of day etc) How many books were not in the library for a given day Coming from a strong RDBMS background, I had instructed my team to stick to nodes and their natural relationships. Creating a artificial relationship CURRENTLY_BORROWED between publisher and book was not in our mind. When I first read about traversal speed of 1000-3000/millisecond, I added some buffer and assumed 500/millisecond as a realistic speed. I am not giving up so easily after seeing 1/millisecond. I look forward to responses from other users. The real challenges will be around queries for a publisher. A publisher will have around 15,000 books and a query like "Given a published ID, what percentage of his books were never borrowed" will need full browsing. My hope was that I could browse through and get the answer in 30 milliseconds. But it looks like it will take a minimum of 15 seconds. Some publishers will have 50,000 books and I can't imagine a response time of 50 seconds. So, I have to achieve at least 500/millisecond if not the original 1000. Regards SDev On Sat, May 15, 2010 at 4:59 PM, wrote: > Also, can you describe how you are using properties in this > scenario? What types of properties, approximate size of the data, > etc... > > > > Original Message > Subject: [Neo] Traversal Speed is just 1 millisecond per node > From: suryadev vasudev > Date: Sat, May 15, 2010 5:34 pm > To: user@lists.neo4j.org > We are considering Neo4J for a decision making application. The > application > is analogous to a Library having 15 million books. We have BOOKS, > PUBLISHERS > and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship > to > one publisher. STUDENTS may borrow a book, reserve a book or return a > borrowed book. Each is a relationship type meaning BORROWED_BY, > RESERVED_BY > and RETURNS between BOOKS and STUDENTS. > When we traverse starting from a publisher, the traversing speed is > 200-1000 > nodes per millisecond. This is pure traversal to get a book count by > publisher. > The Neo is failing us when we make a slightly complex query. > Starting with a publisher, retrieve all books that are currently lent > out. > Starting with a publisher, retrieve all books that were borrowed > between May > 1 2010 and May 10 2010. > The response time we got was 1-2 millisecond per book. > Before running the test, we created between 0-3 relationships for each > book. > We have seeded 15,000 students ,1000 publishers and 15 million books. > And the server is a 8GB RAM machine. > I wonder why the traversal is drastically slow? > Regards > SDev > ___ > Neo mailing list > User@lists.neo4j.org > [1]https://lists.neo4j.org/mailman/listinfo/user > > References > > 1. https://lists.neo4j.org/mailman/listinfo/user > ___ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Traversal Speed is just 1 millisecond per node
Also, can you describe how you are using properties in this scenario? What types of properties, approximate size of the data, etc... Original Message Subject: [Neo] Traversal Speed is just 1 millisecond per node From: suryadev vasudev Date: Sat, May 15, 2010 5:34 pm To: user@lists.neo4j.org We are considering Neo4J for a decision making application. The application is analogous to a Library having 15 million books. We have BOOKS, PUBLISHERS and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship to one publisher. STUDENTS may borrow a book, reserve a book or return a borrowed book. Each is a relationship type meaning BORROWED_BY, RESERVED_BY and RETURNS between BOOKS and STUDENTS. When we traverse starting from a publisher, the traversing speed is 200-1000 nodes per millisecond. This is pure traversal to get a book count by publisher. The Neo is failing us when we make a slightly complex query. Starting with a publisher, retrieve all books that are currently lent out. Starting with a publisher, retrieve all books that were borrowed between May 1 2010 and May 10 2010. The response time we got was 1-2 millisecond per book. Before running the test, we created between 0-3 relationships for each book. We have seeded 15,000 students ,1000 publishers and 15 million books. And the server is a 8GB RAM machine. I wonder why the traversal is drastically slow? Regards SDev ___ Neo mailing list User@lists.neo4j.org [1]https://lists.neo4j.org/mailman/listinfo/user References 1. https://lists.neo4j.org/mailman/listinfo/user ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Traversal Speed is just 1 millisecond per node
Don't forget that there is a DRAMATIC difference between "warm" benchmark results and "cold" results. If you can do a few extensive queries to pre-load the nodes/relationships, the results should be much better. Also, it would be useful to look at your code, as I suspect there is something in there that is causing the three order of magnitude reduction in performance. Original Message Subject: [Neo] Traversal Speed is just 1 millisecond per node From: suryadev vasudev Date: Sat, May 15, 2010 5:34 pm To: user@lists.neo4j.org We are considering Neo4J for a decision making application. The application is analogous to a Library having 15 million books. We have BOOKS, PUBLISHERS and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship to one publisher. STUDENTS may borrow a book, reserve a book or return a borrowed book. Each is a relationship type meaning BORROWED_BY, RESERVED_BY and RETURNS between BOOKS and STUDENTS. When we traverse starting from a publisher, the traversing speed is 200-1000 nodes per millisecond. This is pure traversal to get a book count by publisher. The Neo is failing us when we make a slightly complex query. Starting with a publisher, retrieve all books that are currently lent out. Starting with a publisher, retrieve all books that were borrowed between May 1 2010 and May 10 2010. The response time we got was 1-2 millisecond per book. Before running the test, we created between 0-3 relationships for each book. We have seeded 15,000 students ,1000 publishers and 15 million books. And the server is a 8GB RAM machine. I wonder why the traversal is drastically slow? Regards SDev ___ Neo mailing list User@lists.neo4j.org [1]https://lists.neo4j.org/mailman/listinfo/user References 1. https://lists.neo4j.org/mailman/listinfo/user ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Traversal Speed is just 1 millisecond per node
Hi, Adding onto Craig's thoughts, I'd like to point you to some related work in this area: 1. Modeling a library as a graph. - slides: http://www.slideshare.net/slidarko/a-practical-ontology-for-the-largescale-modeling-of-scholarly-artifacts-and-their-usage-3879791 - article: http://arxiv.org/abs/0708.1150 2. Doing 'slightly complex queries' as graph traversal over graph databases such as Neo4j: - software framework: http://pipes.tinkerpop.com - pipes give you fine-grained control over your walker with good speed: http://bit.ly/aa29MO - related article: http://arxiv.org/abs/0806.2274 - related article: http://arxiv.org/abs/1004.1001 Take care, Marko. http://tinkerpop.com http://markorodriguez.com On May 15, 2010, at 4:05 PM, Craig Taverner wrote: > My 2 cents, without knowing the structure of your data (which is needed to > really answer the question). > > I assume when you say 'slightly complex query' you are probably using a > custom traverser that looks at properties of nodes and/or relationships to > make the decision, or possibly even follows a relationship to make the > decision. All of these options will slow things down. Your original > traverser probably only considered relationship types and directions, > loading from only the relationships table. The new one hits the properties > tables, possibly for both nodes and relationships. > > If this is the case, the improvement is much the same as you would do in a > relational database, which is to index the data. However, indexing is > different in a graph, and I think the best way to do that in your case is to > build additional graph structures that allow the new traverser to only look > at relationships. For example, you say that you are interested in books from > a particular published currently lent out. Consider having the publisher not > have direct relationships to their books (a publisher index), but instead > have relationships to 'borrowed' and 'not borrowed' nodes and those are > related to the books (effectively a combined publisher-borrowing_status > index). When a book is borrowed, move it's relationship. Since borrowing a > book occurs occasionally over very long times (days or weeks), this database > edit has no performance cost, but makes the query you are looking for very > fast. To add a time period to this situation, consider the TimeLineIndex. > Alternatively extend the previous concept to have nodes representing books > borrowed on certain days, for example. > > The real solution is really dependent on your data and the kinds of queries > you plan to make. You probably already made the publisher-book relationships > because you planned to make a query like that. The more complex queries you > wish to make the more complex structure you will probably devise. Neo4j is > great in that you can keep optimizing by adding appropriate structure > without removing previous capabilities. > > On Sat, May 15, 2010 at 11:34 PM, suryadev vasudev < > suryadev.vasu...@gmail.com> wrote: > >> We are considering Neo4J for a decision making application. The application >> is analogous to a Library having 15 million books. We have BOOKS, >> PUBLISHERS >> and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship to >> one publisher. STUDENTS may borrow a book, reserve a book or return a >> borrowed book. Each is a relationship type meaning BORROWED_BY, RESERVED_BY >> and RETURNS between BOOKS and STUDENTS. >> When we traverse starting from a publisher, the traversing speed is >> 200-1000 >> nodes per millisecond. This is pure traversal to get a book count by >> publisher. >> The Neo is failing us when we make a slightly complex query. >> Starting with a publisher, retrieve all books that are currently lent out. >> Starting with a publisher, retrieve all books that were borrowed between >> May >> 1 2010 and May 10 2010. >> The response time we got was 1-2 millisecond per book. >> Before running the test, we created between 0-3 relationships for each >> book. >> We have seeded 15,000 students ,1000 publishers and 15 million books. >> And the server is a 8GB RAM machine. >> I wonder why the traversal is drastically slow? >> Regards >> SDev >> ___ >> Neo mailing list >> User@lists.neo4j.org >> https://lists.neo4j.org/mailman/listinfo/user >> > ___ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo] Traversal Speed is just 1 millisecond per node
My 2 cents, without knowing the structure of your data (which is needed to really answer the question). I assume when you say 'slightly complex query' you are probably using a custom traverser that looks at properties of nodes and/or relationships to make the decision, or possibly even follows a relationship to make the decision. All of these options will slow things down. Your original traverser probably only considered relationship types and directions, loading from only the relationships table. The new one hits the properties tables, possibly for both nodes and relationships. If this is the case, the improvement is much the same as you would do in a relational database, which is to index the data. However, indexing is different in a graph, and I think the best way to do that in your case is to build additional graph structures that allow the new traverser to only look at relationships. For example, you say that you are interested in books from a particular published currently lent out. Consider having the publisher not have direct relationships to their books (a publisher index), but instead have relationships to 'borrowed' and 'not borrowed' nodes and those are related to the books (effectively a combined publisher-borrowing_status index). When a book is borrowed, move it's relationship. Since borrowing a book occurs occasionally over very long times (days or weeks), this database edit has no performance cost, but makes the query you are looking for very fast. To add a time period to this situation, consider the TimeLineIndex. Alternatively extend the previous concept to have nodes representing books borrowed on certain days, for example. The real solution is really dependent on your data and the kinds of queries you plan to make. You probably already made the publisher-book relationships because you planned to make a query like that. The more complex queries you wish to make the more complex structure you will probably devise. Neo4j is great in that you can keep optimizing by adding appropriate structure without removing previous capabilities. On Sat, May 15, 2010 at 11:34 PM, suryadev vasudev < suryadev.vasu...@gmail.com> wrote: > We are considering Neo4J for a decision making application. The application > is analogous to a Library having 15 million books. We have BOOKS, > PUBLISHERS > and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship to > one publisher. STUDENTS may borrow a book, reserve a book or return a > borrowed book. Each is a relationship type meaning BORROWED_BY, RESERVED_BY > and RETURNS between BOOKS and STUDENTS. > When we traverse starting from a publisher, the traversing speed is > 200-1000 > nodes per millisecond. This is pure traversal to get a book count by > publisher. > The Neo is failing us when we make a slightly complex query. > Starting with a publisher, retrieve all books that are currently lent out. > Starting with a publisher, retrieve all books that were borrowed between > May > 1 2010 and May 10 2010. > The response time we got was 1-2 millisecond per book. > Before running the test, we created between 0-3 relationships for each > book. > We have seeded 15,000 students ,1000 publishers and 15 million books. > And the server is a 8GB RAM machine. > I wonder why the traversal is drastically slow? > Regards > SDev > ___ > Neo mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > ___ Neo mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user