OK, I've finished all the short read queries and all the update queries. You can find them all at the github repo below, which includes tools for performance testing ArangoDB on the queries in the benchmark.
https://github.com/PlatformLab/ldbc-snb-impls/tree/master/snb-interactive-arangodb And you can find the SF0001 dataset available for download below, which includes a script for loading the dataset into an ArangoDB cluster / instance (you'll probably need to modify it for your own needs e.g. login credentials or server locations): https://www.dropbox.com/s/nplg3du0npzav7e/ldbc_snb_sf0001-arangodb-2019-07-13.tar.gz?dl=0 Complex queries are not implemented. Happy to take pull requests on those if anyone is up for the challenge. Feel free to contact me if you need any help getting things setup. Best, Jonathan On Thursday, July 11, 2019 at 9:24:05 PM UTC-7, Jonathan Ellithorpe wrote: > > Hi All, > > Hammered out two of the simple read queries from the benchmark. Thought I > would share and ask for some early feedback, make sure I'm not missing out > on any obvious query performance optimizations. The graph schema for all of > this is here (same as before): > > > [image: ldbc_snb_schema.png] > > > > > ShortQuery1: > > 657 /** > 658 * Given a start Person, retrieve their first name, last name, > birthday, IP > 659 * address, browser, and city of residence.[1] > 660 */ > ... > 672 ArangoDatabase db = ((ArangoDbConnectionState) > dbConnectionState).getDatabase(); > 673 String statement = > 674 "WITH Place" > 675 + " FOR p IN Person" > 676 + " FILTER p._key == @personId" > 677 + " FOR c IN 1..1 OUTBOUND p isLocatedIn" > 678 + " RETURN {" > 679 + " firstName: p.firstName," > 680 + " lastName: p.lastName," > 681 + " birthday: p.birthday," > 682 + " locationIP: p.locationIP," > 683 + " browserUsed: p.browserUsed," > 684 + " cityId: c._key," > 685 + " gender: p.gender," > 686 + " creationDate: p.creationDate" > 687 + " }"; > 688 > 689 ArangoCursor<BaseDocument> cursor = db.query( > 690 statement, > 691 new MapBuilder() > 692 .put("personId", String.valueOf(operation.personId())) > 693 .get(), > 694 new AqlQueryOptions(), > 695 BaseDocument.class > 696 ); > 697 > 698 if (cursor.hasNext()) { > 699 BaseDocument doc = cursor.next(); > 700 > 701 resultReporter.report(0, > 702 new LdbcShortQuery1PersonProfileResult( > 703 (String)doc.getAttribute("firstName"), > 704 (String)doc.getAttribute("lastName"), > 705 (Long)doc.getAttribute("birthday"), > 706 (String)doc.getAttribute("locationIP"), > 707 (String)doc.getAttribute("browserUsed"), > 708 Long.decode((String)doc.getAttribute("cityId")), > 709 (String)doc.getAttribute("gender"), > 710 (Long)doc.getAttribute("creationDate")), > 711 operation); > 712 } else { > 713 resultReporter.report(0, null, operation); > 714 } > > > ShortQuery2: > > 718 /** > 719 * Given a start Person, retrieve the last 10 Messages (Posts or > Comments) > 720 * created by that user. For each message, return that message, the > original > 721 * post in its conversation, and the author of that post. If any of > the > 722 * Messages is a Post, then the original Post will be the same > Message, i.e., > 723 * that Message will appear twice in that result. Order results > descending by > 724 * message creation date, then descending by message identifier.[1] > 725 */ > ... > 737 ArangoDatabase db = ((ArangoDbConnectionState) > dbConnectionState).getDatabase(); > 738 String statement = > 739 "WITH Comment, Post" > 740 + " FOR person IN Person" > 741 + " FILTER person._key == @personId" > 742 + " FOR message IN 1..1 INBOUND person hasCreator" > 743 + " SORT message.creationDate DESC, message._key DESC" > 744 + " LIMIT @limit" > 745 + " FOR originalPost IN 0..1024 OUTBOUND message > replyOf" > 746 + " FILTER IS_SAME_COLLECTION('Post', > originalPost._id)" > 747 + " FOR originalPostAuthor IN 1..1 OUTBOUND > originalPost hasCreator" > 748 + " RETURN {" > 749 + " messageId: message._key," > 750 + " messageContent: message.content," > 751 + " messageImageFile: message.imageFile," > 752 + " messageCreationDate: message.creationDate," > 753 + " originalPostId: originalPost._key," > 754 + " originalPostAuthorId: originalPostAuthor._key," > 755 + " originalPostAuthorFirstName: > originalPostAuthor.firstName," > 756 + " originalPostAuthorLastName: > originalPostAuthor.lastName" > 757 + " }"; > 758 > 759 ArangoCursor<BaseDocument> cursor = db.query( > 760 statement, > 761 new MapBuilder() > 762 .put("personId", String.valueOf(operation.personId())) > 763 .put("limit", new Integer(operation.limit())) > 764 .get(), > 765 new AqlQueryOptions(), > 766 BaseDocument.class > 767 ); > 768 > 769 List<LdbcShortQuery2PersonPostsResult> resultList = new > ArrayList<>(); > 770 > 771 while (cursor.hasNext()) { > 772 BaseDocument doc = cursor.next(); > 773 > 774 String content = (String)doc.getAttribute("messageContent"); > 775 if (content == null) { > 776 content = (String)doc.getAttribute("messageImageFile"); > 777 } > 778 > 779 resultList.add(new LdbcShortQuery2PersonPostsResult( > 780 Long.valueOf((String)doc.getAttribute("messageId")), > 781 content, > 782 (Long)doc.getAttribute("messageCreationDate"), > 783 Long.valueOf((String)doc.getAttribute("originalPostId")), > 784 > Long.valueOf((String)doc.getAttribute("originalPostAuthorId")), > 785 (String)doc.getAttribute("originalPostAuthorFirstName"), > 786 (String)doc.getAttribute("originalPostAuthorLastName"))); > 787 } > 788 > 789 resultReporter.report(0, resultList, operation); > > > Thanks in advance! > > Best, > Jonathan > > On Wednesday, July 10, 2019 at 1:14:53 PM UTC-7, Jonathan Ellithorpe wrote: >> >> Hi Jan, >> >> Yup, completely understand. I'll send you the details you asked about >> over e-mail later today. >> >> Best, >> Jonathan >> >> On Wednesday, July 10, 2019 at 9:55:34 AM UTC-7, jan.stuecke wrote: >>> >>> Hey Jonathan, >>> >>> ok, sounds very interesting! Super cool pre-work. That helps a lot. >>> >>> Would be happy to collaborate with you on this but have to check back >>> with our graph specialists on our side first. I don’t want to promise >>> anything and then our guys are fully booked with customer projects & prod >>> dev. >>> >>> Happy to keep this thread alive and post potential updates here for >>> everybody but for the details, we could switch to email. You can reach me >>> via [email protected]. Would be great if you could send me the >>> analytical queries and the amount of documents per collection like persons, >>> tabs, etc in your 1 TB dataset. Then I can discuss with our seniors over >>> here. >>> >>> Best, Jan >>> >>> On Wed 10. Jul 2019 at 07:44, Jonathan Ellithorpe <[email protected]> >>> wrote: >>> >>>> Hi Jan, >>>> [image: ldbc_snb_schema.png] >>>> >>>> Thanks for that explanation, that does help, I'm glad that got resolved >>>> (haven't seen that thread updated yet with the resolution). >>>> >>>> >>>> The LDBC Social Network Benchmark is more property graph focused >>>> actually. I've included an image of the graph schema to illustrate. >>>> >>>> >>>> While the schema is relatively straightforward, the benchmark is fairly >>>> comprehensive and challenging, including a total of 29 queries, 14 complex >>>> "analytical" type read-only queries, 7 simple read-only queries, and 8 >>>> update queries that add people and posts and likes and so on to the graph. >>>> >>>> >>>> I have a working implementation for Neo4j (as well as my own graph >>>> database I've been working on as a research project) in the following repo: >>>> >>>> >>>> https://github.com/PlatformLab/ldbc-snb-impls >>>> >>>> >>>> I just added a skeleton for an ArangoDB implementation. Since I'm not >>>> familiar with AQL (just started playing around with it today), I estimate >>>> it would take me considerable time to complete a full implementation. I >>>> may >>>> be able to flesh out the simpler short read queries and updates in a >>>> couple >>>> of days, but the 14 "analytical" style complex queries are where things >>>> get... well... complicated. The hard part is making sure I'm doing the >>>> target database justice and making sure I've written the query in the most >>>> performant manner possible. Even with the gracious help of the (amazing) >>>> developers at Apache TinkerPop (many thanks to them for their help), >>>> getting a Gremlin implementation just to pass validation was about a man >>>> month of work (includes learning Gremlin), and then another week or two on >>>> top of that to work out inefficiencies in the query implementations. >>>> >>>> >>>> Would be happy to collaborate on this, as I've already been working >>>> with this benchmark for quite a while and have datasets (up to 1TB in >>>> size) >>>> available for use, along with various tools and validation data for >>>> testing. What I do not have, however, is ArangoDB / AQL expertise to >>>> produce the highest performance complex query implementations possible for >>>> ArangoDB (the simple read and update queries are simple enough I believe I >>>> can work those out fairly easily). >>>> >>>> >>>> Cheers, >>>> >>>> Jonathan >>>> >>>> >>>> >>>> >>>> On Tuesday, July 9, 2019 at 9:06:23 PM UTC-7, jan.stuecke wrote: >>>> >>>>> Hi Jonathan, >>>>> >>>>> this is Jan from ArangoDB. >>>>> >>>>> Thanks for the hint with the LDBC Benchmark. We will have a look if >>>>> this is a suitable setup for ArangoDB. Quite often these benchmarks are >>>>> focused on RDF stores but the graph part of ArangoDBs multi model >>>>> offering >>>>> is rather following a property graph model. >>>>> >>>>> I forwarded the reported bulk load question to our Java specialist. >>>>> Hope he will find some time to assist here. >>>>> >>>>> Please note, that the problem with the “very simple query” wasn’t >>>>> necessarily on ArangoDB side and was solved by remodeling the data. The >>>>> user was storing huge binaries in ArangoDB which is possible but its >>>>> recommended to store it in a way that allows fast queries on the meta >>>>> data >>>>> and only access the binary data if necessary. E.g if you store pictures, >>>>> pdfs or similar blobs, we recommend to store the meta data in collection >>>>> A >>>>> and the actual blob in collection B if you want to store both in Arango. >>>>> Because if you store everything in one big JSON document, a query against >>>>> it has to access the whole document during runtime -> a lot of unneeded >>>>> processing -> query runtime increases. >>>>> >>>>> The recommended way fro mour side for best performance in these cases >>>>> is to store meta data in ArangoDB and use a dedicated filesystem for your >>>>> binary data. >>>>> >>>>> Hope that helped. >>>>> >>>>> Best, Jan >>>>> >>>>> On Tue 9. Jul 2019 at 17:06, Jonathan Ellithorpe <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello All, >>>>>> >>>>>> Has anyone worked on an implementation of the LDBC Social Network >>>>>> Benchmark for ArangoDB? >>>>>> >>>>>> I see some folks here evidently struggling with ArangoDB performance >>>>>> on even very simple queries (e.g. >>>>>> https://groups.google.com/forum/#!topic/arangodb/sIOQ1xzJSpc), as >>>>>> well as how to efficiently bulk load graph data (e.g. >>>>>> https://groups.google.com/forum/#!topic/arangodb/4eI3fvUzDYg). >>>>>> >>>>>> An implementation of the above mentioned benchmark should serve >>>>>> nicely to show how to performantly use ArangoDB and AQL, including the >>>>>> bulk >>>>>> loading of graph data, besides showing ArangoDB's performance >>>>>> capabilities. >>>>>> >>>>>> Jonathan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "ArangoDB" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/arangodb/3fa4003d-90c6-4aa9-9e40-d833155c14d0%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/arangodb/3fa4003d-90c6-4aa9-9e40-d833155c14d0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>>> >>>>> *Jan Stücke* >>>>> Head of Communications >>>>> >>>>> [email protected] | +49 (0)221 / 2722999-60 >>>>> >>>>> >>>>> *Help us grow the multi-model vision with your review on Gartner Peer >>>>> Reviews >>>>> <https://www.gartner.com/reviews/market/operational-dbms/vendor/arangodb> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "ArangoDB" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/arangodb/14150ba1-330c-416a-b264-2fb374f37f44%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/arangodb/14150ba1-330c-416a-b264-2fb374f37f44%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >>> >>> *Jan Stücke* >>> Head of Communications >>> >>> [email protected] | +49 (0)221 / 2722999-60 >>> >>> >>> *Help us grow the multi-model vision with your review on Gartner Peer >>> Reviews >>> <https://www.gartner.com/reviews/market/operational-dbms/vendor/arangodb> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "ArangoDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/07d32e8c-a97d-41fd-9346-15eb6cbc2643%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
