Re: [Neo] Traversal Speed is just 1 millisecond per node

2010-05-17 Thread Johan Svensson
Hi,

I understand the graph layout as this:

(publisher node)<--PUBLISHED_BY---(book)<--BORROWED_BY--(student)

There are other relationships between books and students
(RETURNED_BY,RESERVED_BY) but the relationship count on a book node
will still be low compared to the publisher node. Correct?

On a machine with 8GB RAM and that graph calculating "amount of
borrowed books" for a publisher or "get amount of books never
borrowed" for a publisher should execute in under 100ms given your
traversal speed is 300+ hops/ms (in 30ms with 1000 hops/ms). Normal
cached traversal speed on any modern hardware should be 1000+ hops/ms.

The poor performance when performing the depth two traversal to get
the amount of borrowed relationships on a publisher is indeed very
strange.

A possible reason could be that the traverser is traversing more than
it should. For example if you create a traverser that traverses the
PUBLISHED_BY and BORROWED_BY relationships in both directions you
would traverse (publisher)<-(book)<-(student)->(other book)->(other or
same publisher)<-(other book)... and so on (and that would take time).

Also make sure you are not measuring number of "borrowed books
traversal speed" over all publishers instead of a single publisher
(retrieving borrowed books from all publishers with a low borrowed
ratio would result in a low "borrowed books/ms" value).

Could you write a simple traverser using the core API that calculates
the relationship count at depth two given a publisher node. Something
like:

Node publisher = // get a publisher node
int count = 0;
for ( Relationship bookRel : publisher.getRelationships(
RelTypes.PUBLISHED_BY ) )
{
count++;
Node book = bookRel.getOtherNode( publisher );
for ( Relationship studentRel : book.getRelationships(
RelTypes.BORROWED_BY ) )
{
count++;
}
}

Avg count and time to perform that operation wold be interesting. If
timings are not in expected range (basically 15k [avg number of
rels/publisher] x 2 [the borrowed_by rel] x traversal speed) try get
the total relationship count by replacing getRelationships(type) with
getRelationships() to get worst case scenario on how much traversing
needs to be done.

Regards,
-Johan

On Sun, May 16, 2010 at 4:08 AM, suryadev vasudev
 wrote:
> Here is a rough design and volume
>
> ...
>
> When I first read about traversal speed of 1000-3000/millisecond,  I added
> some buffer and assumed 500/millisecond as a realistic speed. I am not
> giving up so easily after seeing 1/millisecond. I look forward to responses
> from other users.
>
> The real challenges will be around queries for a publisher. A publisher will
> have around 15,000 books and a query like "Given a published ID, what
> percentage of his books were never borrowed"  will need full browsing. My
> hope was that I could browse through and get the answer in 30 milliseconds.
> But it looks like it will take a minimum of 15 seconds.
>
> Some publishers will have 50,000 books and I can't imagine a response time
> of 50 seconds.
>
> So, I have to achieve at least 500/millisecond if not the original 1000.
>
> Regards
> SDev
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Traversal Speed is just 1 millisecond per node

2010-05-15 Thread suryadev vasudev
Here is a rough design and volume

NODE:PUBLISHER (1000 publishers)
published_id
Publisher_name
Publisher address
Publisher City
Publisher State
Publisher Country
Publisher Primary Email
Publisher URL

NODE:STUDENT (15,000 students)
Student_id
Student_first_name
Student_last_name
Student_registration_date
Student_course_completion_date
Student_Email_id

NODE:BOOK (15 million books)
Book_id
Book_ISBN
Book_name
Book_Primary_Author
Book_Secondary_Author
Book_Published_year
Book_subject

RELATIONSHIP: PUBLISHED_BY
Purchase_date
Purchase_approved_by
Purchase_contract_number

RELATIONSHIP: BORROWED_BY
borrowed_date
due_date

RELATIONSHIP: RETURNED_BY
borrowed_date
due_date
returned_date
due_amount_paid

RELATIONSHIP: RESERVED_BY
reservation_date

The BORROWED_BY relationship is maintained for an active borrowing. This
relationship is deleted and RETURNED_BY relationship is created when book is
returned. So there can be a maximum of one BORROWED_BY relationship for any
one book. Off course there will be more than one RETURNED_BY for a book.

Many students can reserve the book at any time. All will get a email when a
book is returned.

The application is expected to provide dashboard services and analytical
reports

Student dashboard:
All books borrowed, returned and reserved by a student for a date range
Book dashboard:
Lending history of a book for a given date range
Publisher dashboard:
All books for a particular publisher, lending history
Librarian dashboard:
Lending activities for a given date range (by publisher, by hour of day etc)
How many books were not in the library for a given day


Coming from a strong RDBMS background, I had instructed my team to stick to
nodes and their natural relationships. Creating a artificial relationship
CURRENTLY_BORROWED between publisher and book was not in our mind.

When I first read about traversal speed of 1000-3000/millisecond,  I added
some buffer and assumed 500/millisecond as a realistic speed. I am not
giving up so easily after seeing 1/millisecond. I look forward to responses
from other users.

The real challenges will be around queries for a publisher. A publisher will
have around 15,000 books and a query like "Given a published ID, what
percentage of his books were never borrowed"  will need full browsing. My
hope was that I could browse through and get the answer in 30 milliseconds.
But it looks like it will take a minimum of 15 seconds.

Some publishers will have 50,000 books and I can't imagine a response time
of 50 seconds.

So, I have to achieve at least 500/millisecond if not the original 1000.

Regards
SDev

On Sat, May 15, 2010 at 4:59 PM, wrote:

>   Also, can you describe how you are using properties in this
>   scenario?  What types of properties, approximate size of the data,
>   etc...
>
>
>
>    Original Message 
>   Subject: [Neo] Traversal Speed is just 1 millisecond per node
>   From: suryadev vasudev 
>   Date: Sat, May 15, 2010 5:34 pm
>   To: user@lists.neo4j.org
>   We are considering Neo4J for a decision making application. The
>   application
>   is analogous to a Library having 15 million books. We have BOOKS,
>   PUBLISHERS
>   and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship
>   to
>   one publisher. STUDENTS may borrow a book, reserve a book or return a
>   borrowed book. Each is a relationship type meaning BORROWED_BY,
>   RESERVED_BY
>   and RETURNS between BOOKS and STUDENTS.
>   When we traverse starting from a publisher, the traversing speed is
>   200-1000
>   nodes per millisecond. This is pure traversal to get a book count by
>   publisher.
>   The Neo is failing us when we make a slightly complex query.
>   Starting with a publisher, retrieve all books that are currently lent
>   out.
>   Starting with a publisher, retrieve all books that were borrowed
>   between May
>   1 2010 and May 10 2010.
>   The response time we got was 1-2 millisecond per book.
>   Before running the test, we created between 0-3 relationships for each
>   book.
>   We have seeded 15,000 students ,1000 publishers and 15 million books.
>   And the server is a 8GB RAM machine.
>   I wonder why the traversal is drastically slow?
>   Regards
>   SDev
>   ___
>   Neo mailing list
>   User@lists.neo4j.org
>   [1]https://lists.neo4j.org/mailman/listinfo/user
>
> References
>
>   1. https://lists.neo4j.org/mailman/listinfo/user
> ___
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Traversal Speed is just 1 millisecond per node

2010-05-15 Thread rick . bullotta
   Also, can you describe how you are using properties in this
   scenario?  What types of properties, approximate size of the data,
   etc...



    Original Message 
   Subject: [Neo] Traversal Speed is just 1 millisecond per node
   From: suryadev vasudev 
   Date: Sat, May 15, 2010 5:34 pm
   To: user@lists.neo4j.org
   We are considering Neo4J for a decision making application. The
   application
   is analogous to a Library having 15 million books. We have BOOKS,
   PUBLISHERS
   and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship
   to
   one publisher. STUDENTS may borrow a book, reserve a book or return a
   borrowed book. Each is a relationship type meaning BORROWED_BY,
   RESERVED_BY
   and RETURNS between BOOKS and STUDENTS.
   When we traverse starting from a publisher, the traversing speed is
   200-1000
   nodes per millisecond. This is pure traversal to get a book count by
   publisher.
   The Neo is failing us when we make a slightly complex query.
   Starting with a publisher, retrieve all books that are currently lent
   out.
   Starting with a publisher, retrieve all books that were borrowed
   between May
   1 2010 and May 10 2010.
   The response time we got was 1-2 millisecond per book.
   Before running the test, we created between 0-3 relationships for each
   book.
   We have seeded 15,000 students ,1000 publishers and 15 million books.
   And the server is a 8GB RAM machine.
   I wonder why the traversal is drastically slow?
   Regards
   SDev
   ___
   Neo mailing list
   User@lists.neo4j.org
   [1]https://lists.neo4j.org/mailman/listinfo/user

References

   1. https://lists.neo4j.org/mailman/listinfo/user
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Traversal Speed is just 1 millisecond per node

2010-05-15 Thread rick . bullotta
   Don't forget that there is a DRAMATIC difference between "warm"
   benchmark results and "cold" results.  If you can do a few extensive
   queries to pre-load the nodes/relationships, the results should be much
   better.



   Also, it would be useful to look at your code, as I suspect there is
   something in there that is causing the three order of magnitude
   reduction in performance.







    Original Message 
   Subject: [Neo] Traversal Speed is just 1 millisecond per node
   From: suryadev vasudev 
   Date: Sat, May 15, 2010 5:34 pm
   To: user@lists.neo4j.org
   We are considering Neo4J for a decision making application. The
   application
   is analogous to a Library having 15 million books. We have BOOKS,
   PUBLISHERS
   and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship
   to
   one publisher. STUDENTS may borrow a book, reserve a book or return a
   borrowed book. Each is a relationship type meaning BORROWED_BY,
   RESERVED_BY
   and RETURNS between BOOKS and STUDENTS.
   When we traverse starting from a publisher, the traversing speed is
   200-1000
   nodes per millisecond. This is pure traversal to get a book count by
   publisher.
   The Neo is failing us when we make a slightly complex query.
   Starting with a publisher, retrieve all books that are currently lent
   out.
   Starting with a publisher, retrieve all books that were borrowed
   between May
   1 2010 and May 10 2010.
   The response time we got was 1-2 millisecond per book.
   Before running the test, we created between 0-3 relationships for each
   book.
   We have seeded 15,000 students ,1000 publishers and 15 million books.
   And the server is a 8GB RAM machine.
   I wonder why the traversal is drastically slow?
   Regards
   SDev
   ___
   Neo mailing list
   User@lists.neo4j.org
   [1]https://lists.neo4j.org/mailman/listinfo/user

References

   1. https://lists.neo4j.org/mailman/listinfo/user
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Traversal Speed is just 1 millisecond per node

2010-05-15 Thread Marko Rodriguez
Hi,

Adding onto Craig's thoughts, I'd like to point you to some related work in 
this area:

1. Modeling a library as a graph.
- slides: 
http://www.slideshare.net/slidarko/a-practical-ontology-for-the-largescale-modeling-of-scholarly-artifacts-and-their-usage-3879791
- article: http://arxiv.org/abs/0708.1150

2. Doing 'slightly complex queries' as graph traversal over graph databases 
such as Neo4j:
- software framework: http://pipes.tinkerpop.com
- pipes give you fine-grained control over your walker with 
good speed: http://bit.ly/aa29MO
- related article: http://arxiv.org/abs/0806.2274
- related article: http://arxiv.org/abs/1004.1001

Take care,
Marko.

http://tinkerpop.com
http://markorodriguez.com

On May 15, 2010, at 4:05 PM, Craig Taverner wrote:

> My 2 cents, without knowing the structure of your data (which is needed to
> really answer the question).
> 
> I assume when you say 'slightly complex query' you are probably using a
> custom traverser that looks at properties of nodes and/or relationships to
> make the decision, or possibly even follows a relationship to make the
> decision. All of these options will slow things down. Your original
> traverser probably only considered relationship types and directions,
> loading from only the relationships table. The new one hits the properties
> tables, possibly for both nodes and relationships.
> 
> If this is the case, the improvement is much the same as you would do in a
> relational database, which is to index the data. However, indexing is
> different in a graph, and I think the best way to do that in your case is to
> build additional graph structures that allow the new traverser to only look
> at relationships. For example, you say that you are interested in books from
> a particular published currently lent out. Consider having the publisher not
> have direct relationships to their books (a publisher index), but instead
> have relationships to 'borrowed' and 'not borrowed' nodes and those are
> related to the books (effectively a combined publisher-borrowing_status
> index). When a book is borrowed, move it's relationship. Since borrowing a
> book occurs occasionally over very long times (days or weeks), this database
> edit has no performance cost, but makes the query you are looking for very
> fast. To add a time period to this situation, consider the TimeLineIndex.
> Alternatively extend the previous concept to have nodes representing books
> borrowed on certain days, for example.
> 
> The real solution is really dependent on your data and the kinds of queries
> you plan to make. You probably already made the publisher-book relationships
> because you planned to make a query like that. The more complex queries you
> wish to make the more complex structure you will probably devise. Neo4j is
> great in that you can keep optimizing by adding appropriate structure
> without removing previous capabilities.
> 
> On Sat, May 15, 2010 at 11:34 PM, suryadev vasudev <
> suryadev.vasu...@gmail.com> wrote:
> 
>> We are considering Neo4J for a decision making application. The application
>> is analogous to a Library having 15 million books. We have BOOKS,
>> PUBLISHERS
>> and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship to
>> one publisher. STUDENTS may borrow a book, reserve a book or return a
>> borrowed book. Each is a relationship type meaning BORROWED_BY, RESERVED_BY
>> and  RETURNS between BOOKS and STUDENTS.
>> When we traverse starting from a publisher, the traversing speed is
>> 200-1000
>> nodes per millisecond. This is pure traversal to get a book count by
>> publisher.
>> The Neo is failing us when we make a slightly complex query.
>> Starting with a publisher, retrieve all books that are currently lent out.
>> Starting with a publisher, retrieve all books that were borrowed between
>> May
>> 1 2010 and May 10 2010.
>> The response time we got was 1-2 millisecond per book.
>> Before running the test, we created between 0-3 relationships for each
>> book.
>> We have seeded 15,000 students ,1000 publishers and 15 million books.
>> And the server is a 8GB RAM machine.
>> I wonder why the traversal is drastically slow?
>> Regards
>> SDev
>> ___
>> Neo mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
>> 
> ___
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo] Traversal Speed is just 1 millisecond per node

2010-05-15 Thread Craig Taverner
My 2 cents, without knowing the structure of your data (which is needed to
really answer the question).

I assume when you say 'slightly complex query' you are probably using a
custom traverser that looks at properties of nodes and/or relationships to
make the decision, or possibly even follows a relationship to make the
decision. All of these options will slow things down. Your original
traverser probably only considered relationship types and directions,
loading from only the relationships table. The new one hits the properties
tables, possibly for both nodes and relationships.

If this is the case, the improvement is much the same as you would do in a
relational database, which is to index the data. However, indexing is
different in a graph, and I think the best way to do that in your case is to
build additional graph structures that allow the new traverser to only look
at relationships. For example, you say that you are interested in books from
a particular published currently lent out. Consider having the publisher not
have direct relationships to their books (a publisher index), but instead
have relationships to 'borrowed' and 'not borrowed' nodes and those are
related to the books (effectively a combined publisher-borrowing_status
index). When a book is borrowed, move it's relationship. Since borrowing a
book occurs occasionally over very long times (days or weeks), this database
edit has no performance cost, but makes the query you are looking for very
fast. To add a time period to this situation, consider the TimeLineIndex.
Alternatively extend the previous concept to have nodes representing books
borrowed on certain days, for example.

The real solution is really dependent on your data and the kinds of queries
you plan to make. You probably already made the publisher-book relationships
because you planned to make a query like that. The more complex queries you
wish to make the more complex structure you will probably devise. Neo4j is
great in that you can keep optimizing by adding appropriate structure
without removing previous capabilities.

On Sat, May 15, 2010 at 11:34 PM, suryadev vasudev <
suryadev.vasu...@gmail.com> wrote:

> We are considering Neo4J for a decision making application. The application
> is analogous to a Library having 15 million books. We have BOOKS,
> PUBLISHERS
> and STUDENTS as nodes. Every book will have a PUBLISHED_BY relationship to
> one publisher. STUDENTS may borrow a book, reserve a book or return a
> borrowed book. Each is a relationship type meaning BORROWED_BY, RESERVED_BY
> and  RETURNS between BOOKS and STUDENTS.
> When we traverse starting from a publisher, the traversing speed is
> 200-1000
> nodes per millisecond. This is pure traversal to get a book count by
> publisher.
> The Neo is failing us when we make a slightly complex query.
> Starting with a publisher, retrieve all books that are currently lent out.
> Starting with a publisher, retrieve all books that were borrowed between
> May
> 1 2010 and May 10 2010.
> The response time we got was 1-2 millisecond per book.
> Before running the test, we created between 0-3 relationships for each
> book.
> We have seeded 15,000 students ,1000 publishers and 15 million books.
> And the server is a 8GB RAM machine.
> I wonder why the traversal is drastically slow?
> Regards
> SDev
> ___
> Neo mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
___
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user