Re: [arangodb-google] LDBC Social Network Benchmark Implementation

Jonathan Ellithorpe Sat, 13 Jul 2019 17:21:17 -0700

OK, I've finished all the short read queries and all the update queries. 

You can find them all at the github repo below, which includes tools for 
performance testing ArangoDB on the queries in the benchmark.


https://github.com/PlatformLab/ldbc-snb-impls/tree/master/snb-interactive-arangodb

And you can find the SF0001 dataset available for download below, which 
includes a script for loading the dataset into an ArangoDB cluster / 
instance (you'll probably need to modify it for your own needs e.g. login 
credentials or server locations):

https://www.dropbox.com/s/nplg3du0npzav7e/ldbc_snb_sf0001-arangodb-2019-07-13.tar.gz?dl=0

Complex queries are not implemented. Happy to take pull requests on those 
if anyone is up for the challenge.

Feel free to contact me if you need any help getting things setup.

Best,
Jonathan


On Thursday, July 11, 2019 at 9:24:05 PM UTC-7, Jonathan Ellithorpe wrote:
>
> Hi All,
>
> Hammered out two of the simple read queries from the benchmark. Thought I 
> would share and ask for some early feedback, make sure I'm not missing out 
> on any obvious query performance optimizations. The graph schema for all of 
> this is here (same as before):
>
>
> [image: ldbc_snb_schema.png]
>
>
>
>
> ShortQuery1:
>
>  657   /**
>  658    * Given a start Person, retrieve their first name, last name, 
> birthday, IP
>  659    * address, browser, and city of residence.[1]
>  660    */
>  ...
>  672       ArangoDatabase db = ((ArangoDbConnectionState) 
> dbConnectionState).getDatabase();
>  673       String statement =
>  674           "WITH Place"
>  675           + " FOR p IN Person"
>  676           + " FILTER p._key == @personId"
>  677           + "   FOR c IN 1..1 OUTBOUND p isLocatedIn"
>  678           + " RETURN {"
>  679           + "   firstName: p.firstName,"
>  680           + "   lastName: p.lastName,"
>  681           + "   birthday: p.birthday,"
>  682           + "   locationIP: p.locationIP,"
>  683           + "   browserUsed: p.browserUsed,"
>  684           + "   cityId: c._key,"
>  685           + "   gender: p.gender,"
>  686           + "   creationDate: p.creationDate"
>  687           + "  }";
>  688
>  689       ArangoCursor<BaseDocument> cursor = db.query(
>  690           statement,
>  691           new MapBuilder()
>  692               .put("personId", String.valueOf(operation.personId()))
>  693               .get(),
>  694           new AqlQueryOptions(),
>  695           BaseDocument.class
>  696         );
>  697
>  698       if (cursor.hasNext()) {
>  699         BaseDocument doc = cursor.next();
>  700
>  701         resultReporter.report(0,
>  702             new LdbcShortQuery1PersonProfileResult(
>  703                 (String)doc.getAttribute("firstName"),
>  704                 (String)doc.getAttribute("lastName"),
>  705                 (Long)doc.getAttribute("birthday"),
>  706                 (String)doc.getAttribute("locationIP"),
>  707                 (String)doc.getAttribute("browserUsed"),
>  708                 Long.decode((String)doc.getAttribute("cityId")),
>  709                 (String)doc.getAttribute("gender"),
>  710                 (Long)doc.getAttribute("creationDate")),
>  711               operation);
>  712       } else {
>  713         resultReporter.report(0, null, operation);
>  714       }
>
>
> ShortQuery2:
>
>  718   /**
>  719    * Given a start Person, retrieve the last 10 Messages (Posts or 
> Comments)
>  720    * created by that user. For each message, return that message, the 
> original
>  721    * post in its conversation, and the author of that post. If any of 
> the
>  722    * Messages is a Post, then the original Post will be the same 
> Message, i.e.,
>  723    * that Message will appear twice in that result. Order results 
> descending by
>  724    * message creation date, then descending by message identifier.[1]
>  725    */
>  ...
>  737       ArangoDatabase db = ((ArangoDbConnectionState) 
> dbConnectionState).getDatabase();
>  738       String statement =
>  739           "WITH Comment, Post"
>  740           + " FOR person IN Person"
>  741           + " FILTER person._key == @personId"
>  742           + "   FOR message IN 1..1 INBOUND person hasCreator"
>  743           + "     SORT message.creationDate DESC, message._key DESC"
>  744           + "     LIMIT @limit"
>  745           + "     FOR originalPost IN 0..1024 OUTBOUND message 
> replyOf"
>  746           + "       FILTER IS_SAME_COLLECTION('Post', 
> originalPost._id)"
>  747           + "         FOR originalPostAuthor IN 1..1 OUTBOUND 
> originalPost hasCreator"
>  748           + " RETURN {"
>  749           + "   messageId: message._key,"
>  750           + "   messageContent: message.content,"
>  751           + "   messageImageFile: message.imageFile,"
>  752           + "   messageCreationDate: message.creationDate,"
>  753           + "   originalPostId: originalPost._key,"
>  754           + "   originalPostAuthorId: originalPostAuthor._key,"
>  755           + "   originalPostAuthorFirstName: 
> originalPostAuthor.firstName,"
>  756           + "   originalPostAuthorLastName: 
> originalPostAuthor.lastName"
>  757           + "  }";
>  758
>  759       ArangoCursor<BaseDocument> cursor = db.query(
>  760           statement,
>  761           new MapBuilder()
>  762               .put("personId", String.valueOf(operation.personId()))
>  763               .put("limit", new Integer(operation.limit()))
>  764               .get(),
>  765           new AqlQueryOptions(),
>  766           BaseDocument.class
>  767         );
>  768
>  769       List<LdbcShortQuery2PersonPostsResult> resultList = new 
> ArrayList<>();
>  770
>  771       while (cursor.hasNext()) {
>  772         BaseDocument doc = cursor.next();
>  773
>  774         String content = (String)doc.getAttribute("messageContent");
>  775         if (content == null) {
>  776           content = (String)doc.getAttribute("messageImageFile");
>  777         }
>  778
>  779         resultList.add(new LdbcShortQuery2PersonPostsResult(
>  780             Long.valueOf((String)doc.getAttribute("messageId")),
>  781             content,
>  782             (Long)doc.getAttribute("messageCreationDate"),
>  783             Long.valueOf((String)doc.getAttribute("originalPostId")),
>  784            
>  Long.valueOf((String)doc.getAttribute("originalPostAuthorId")),
>  785             (String)doc.getAttribute("originalPostAuthorFirstName"),
>  786             (String)doc.getAttribute("originalPostAuthorLastName")));
>  787       }
>  788
>  789       resultReporter.report(0, resultList, operation);
>
>
> Thanks in advance!
>
> Best,
> Jonathan
>
> On Wednesday, July 10, 2019 at 1:14:53 PM UTC-7, Jonathan Ellithorpe wrote:
>>
>> Hi Jan,
>>
>> Yup, completely understand. I'll send you the details you asked about 
>> over e-mail later today. 
>>
>> Best,
>> Jonathan
>>
>> On Wednesday, July 10, 2019 at 9:55:34 AM UTC-7, jan.stuecke wrote:
>>>
>>> Hey Jonathan,
>>>
>>> ok, sounds very interesting! Super cool pre-work. That helps a lot.
>>>
>>> Would be happy to collaborate with you on this but have to check back 
>>> with our graph specialists on our side first. I don’t want to promise 
>>> anything and then our guys are fully booked with customer projects & prod 
>>> dev.
>>>
>>> Happy to keep this thread alive and post potential updates here for 
>>> everybody but for the details, we could switch to email. You can reach me 
>>> via [email protected]. Would be great if you could send me the 
>>> analytical queries and the amount of documents per collection like persons, 
>>> tabs, etc in your 1 TB dataset. Then I can discuss with our seniors over 
>>> here.
>>>
>>> Best, Jan
>>>
>>> On Wed 10. Jul 2019 at 07:44, Jonathan Ellithorpe <[email protected]> 
>>> wrote:
>>>
>>>> Hi Jan,
>>>> [image: ldbc_snb_schema.png]
>>>>
>>>> Thanks for that explanation, that does help, I'm glad that got resolved 
>>>> (haven't seen that thread updated yet with the resolution).
>>>>
>>>>
>>>> The LDBC Social Network Benchmark is more property graph focused 
>>>> actually. I've included an image of the graph schema to illustrate.
>>>>
>>>>
>>>> While the schema is relatively straightforward, the benchmark is fairly 
>>>> comprehensive and challenging, including a total of 29 queries, 14 complex 
>>>> "analytical" type read-only queries, 7 simple read-only queries, and 8 
>>>> update queries that add people and posts and likes and so on to the graph.
>>>>
>>>>
>>>> I have a working implementation for Neo4j (as well as my own graph 
>>>> database I've been working on as a research project) in the following repo:
>>>>
>>>>
>>>> https://github.com/PlatformLab/ldbc-snb-impls
>>>>
>>>>
>>>> I just added a skeleton for an ArangoDB implementation. Since I'm not 
>>>> familiar with AQL (just started playing around with it today), I estimate 
>>>> it would take me considerable time to complete a full implementation. I 
>>>> may 
>>>> be able to flesh out the simpler short read queries and updates in a 
>>>> couple 
>>>> of days, but the 14 "analytical" style complex queries are where things 
>>>> get... well... complicated. The hard part is making sure I'm doing the 
>>>> target database justice and making sure I've written the query in the most 
>>>> performant manner possible. Even with the gracious help of the (amazing) 
>>>> developers at Apache TinkerPop (many thanks to them for their help), 
>>>> getting a Gremlin implementation just to pass validation was about a man 
>>>> month of work (includes learning Gremlin), and then another week or two on 
>>>> top of that to work out inefficiencies in the query implementations.
>>>>
>>>>
>>>> Would be happy to collaborate on this, as I've already been working 
>>>> with this benchmark for quite a while and have datasets (up to 1TB in 
>>>> size) 
>>>> available for use, along with various tools and validation data for 
>>>> testing. What I do not have, however, is ArangoDB / AQL expertise to 
>>>> produce the highest performance complex query implementations possible for 
>>>> ArangoDB (the simple read and update queries are simple enough I believe I 
>>>> can work those out fairly easily).
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Jonathan
>>>>
>>>>
>>>>
>>>>
>>>> On Tuesday, July 9, 2019 at 9:06:23 PM UTC-7, jan.stuecke wrote:
>>>>
>>>>> Hi Jonathan,
>>>>>
>>>>> this is Jan from ArangoDB.
>>>>>
>>>>> Thanks for the hint with the LDBC Benchmark. We will have a look if 
>>>>> this is a suitable setup for ArangoDB. Quite often these benchmarks are 
>>>>> focused on RDF stores but the graph part of ArangoDBs multi model 
>>>>> offering 
>>>>> is rather following a property graph model.
>>>>>
>>>>> I forwarded the reported bulk load question to our Java specialist. 
>>>>> Hope he will find some time to assist here.
>>>>>
>>>>> Please note, that the problem with the “very simple query” wasn’t 
>>>>> necessarily on ArangoDB side and was solved by remodeling the data. The 
>>>>> user was storing huge binaries in ArangoDB which is possible but its 
>>>>> recommended to store it in a way that allows fast queries on the meta 
>>>>> data 
>>>>> and only access the binary data if necessary. E.g if you store pictures, 
>>>>> pdfs or similar blobs, we recommend to store the meta data in collection 
>>>>> A 
>>>>> and the actual blob in collection B if you want to store both in Arango. 
>>>>> Because if you store everything in one big JSON document, a query against 
>>>>> it has to access the whole document during runtime -> a lot of unneeded 
>>>>> processing -> query runtime increases. 
>>>>>
>>>>> The recommended way fro mour side for best performance in these cases 
>>>>> is to store meta data in ArangoDB and use a dedicated filesystem for your 
>>>>> binary data.
>>>>>
>>>>> Hope that helped.
>>>>>
>>>>> Best, Jan
>>>>>
>>>>> On Tue 9. Jul 2019 at 17:06, Jonathan Ellithorpe <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Hello All,
>>>>>>
>>>>>> Has anyone worked on an implementation of the LDBC Social Network 
>>>>>> Benchmark for ArangoDB?
>>>>>>
>>>>>> I see some folks here evidently struggling with ArangoDB performance 
>>>>>> on even very simple queries (e.g. 
>>>>>> https://groups.google.com/forum/#!topic/arangodb/sIOQ1xzJSpc), as 
>>>>>> well as how to efficiently bulk load graph data (e.g. 
>>>>>> https://groups.google.com/forum/#!topic/arangodb/4eI3fvUzDYg). 
>>>>>>
>>>>>> An implementation of the above mentioned benchmark should serve 
>>>>>> nicely to show how to performantly use ArangoDB and AQL, including the 
>>>>>> bulk 
>>>>>> loading of graph data, besides showing ArangoDB's performance 
>>>>>> capabilities.
>>>>>>
>>>>>> Jonathan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "ArangoDB" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/arangodb/3fa4003d-90c6-4aa9-9e40-d833155c14d0%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/arangodb/3fa4003d-90c6-4aa9-9e40-d833155c14d0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> -- 
>>>>>
>>>>> *Jan Stücke*
>>>>> Head of Communications
>>>>>
>>>>> [email protected] | +49 (0)221 / 2722999-60
>>>>>
>>>>>
>>>>> *Help us grow the multi-model vision with your review on Gartner Peer 
>>>>> Reviews 
>>>>> <https://www.gartner.com/reviews/market/operational-dbms/vendor/arangodb>
>>>>> .
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "ArangoDB" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/arangodb/14150ba1-330c-416a-b264-2fb374f37f44%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/arangodb/14150ba1-330c-416a-b264-2fb374f37f44%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>>>
>>> *Jan Stücke*
>>> Head of Communications
>>>
>>> [email protected] | +49 (0)221 / 2722999-60
>>>
>>>
>>> *Help us grow the multi-model vision with your review on Gartner Peer 
>>> Reviews 
>>> <https://www.gartner.com/reviews/market/operational-dbms/vendor/arangodb>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/07d32e8c-a97d-41fd-9346-15eb6cbc2643%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [arangodb-google] LDBC Social Network Benchmark Implementation

Reply via email to