[arangodb-google] Re: Arango newbee trying to understand poor performance

Frank Celler Wed, 26 Feb 2020 03:14:53 -0800


> We have written a test harness to evaluate performance of a number of 
graph alternatives.

> The original snippet of code is not part of the harness, but was an 
example of how we are

> adding data through the java driver.  Because we are currently using 
neo4j, that was the

> initial implementation for that harness.  The test consists of adding 2M 
nodes using

> batches/transactions of 500.  Tests are being run initially on dev 
laptops with the end goal

> of running all of the tests on a single, more powerful environment.  With 
neo, we are able

> to add the 2M nodes in roughly 6-7 minutes.  With Arangodb, we are in the 
neighborhood

> of 35-40 minutes for the same data so as you can see, this is a pretty 
dramatic difference.  

Hi Rob,

there are different approaches to improve the performance considerably.

(1) Single Document Operation

The initial program you provided will use a single document operation for 
each vertex and edge that will be inserted. This is based on the 
synchronous driver so that it will not run in parallel.

In order to make better use of the server, you can use threads in Java to 
create the vertices and edges in parallel. Also, in his example program, we 
raised the transaction size to 500.

You can find Michele’s version here: 
https://gist.github.com/rashtao/831c7e0281314789a2e2b57e8b3bfe67

This is just a proof of concept and not production code quality.

With this setup we get on a laptop:

1200000 vertexes

800000 edges

elapsed: 195275 ms

10241 insertions/s

This is roughly 10x faster than your numbers. Obviously, this is not the 
same test environment, but the laptop we used is not the fastest.

The drawbacks of this approach are a lot of communication between the 
client and the server. To fully make use of batches the following approach 
will help.

(2) Insert using AQL

You can use AQL to insert batches of vertices and edges. Michele’s example 
program can be found here: 
https://gist.github.com/rashtao/5b72b6187d1a6b50aa129a9f3c5fb2ef

With this version we reach the following numbers (on the same laptop as 
above):

1200000 vertexes

800000 edges

elapsed: 72617 ms

27541 insertions/s

That is a factor of ~2.5 faster than the previous approach. With this 
setup, you can import 2 million documents in 1min12sec.

Please note, that if you use much larger transaction sizes, you should 
enable intermediate commits, see 
https://www.arangodb.com/docs/3.6/transactions-limitations.html#rocksdb-storage-engine

(3) Batch Generation of Documents

We also provide a specialized API for generating batches of documents. This 
can be used as an alternative to (2) and allow you to gain even more 
performance. For example, see 
https://gist.github.com/rashtao/22a43ba5233669d610eca65e06bc7b87

This gives

1200000 vertexes

800000 edges

elapsed: 64428 ms

31042 insertions/s

This is slightly faster than using AQL.

(4) Import

There is also a special API for bulk imports. However, this does not 
support transactions (see 
https://www.arangodb.com/docs/stable/http/bulk-imports.html). We can 
provide more details if required.

(5) Outlook

We are also working on the next version of the Java driver. This will be 
reactive and non-blocking on the network side. It will use fewer threads 
and will do auto-tuning of the parallelism in the client. This will make 
(1) even easier to implement.

If you have any further questions please do not hesitate to ask. 
Alternatively, we can set up a call to discuss the various options.

  Michele & Frank

On Monday, 24 February 2020 15:18:17 UTC+1, Frank Celler wrote:
>
> Hi Rob,
>
> thanks a lot for the details. We will create a similar test environment,
>
> best Frank
>
> On Monday, 24 February 2020 14:53:58 UTC+1, Rob Gratz wrote:
>>
>>
>> We have written a test harness to evaluate performance of a number of 
>> graph alternatives.  The original snippet of code is not part of the 
>> harness, but was an example of how we are adding data through the java 
>> driver.  Because we are currently using neo4j, that was the initial 
>> implementation for that harness.  The test consists of adding 2M nodes 
>> using batches/transactions of 500.  Tests are being run initially on dev 
>> laptops with the end goal of running all of the tests on a single, more 
>> powerful environment.  With neo, we are able to add the 2M nodes in roughly 
>> 6-7 minutes.  With Arangodb, we are in the neighborhood of 35-40 minutes 
>> for the same data so as you can see, this is a pretty dramatic difference.  
>>
>> This is a single arangodb instance running in a docker container.  Here 
>> is the docker-compose file being used:  
>>
>> version: '3.7'
>> services:
>>   arangodb_db_container:
>>     image: arangodb:latest
>>     environment:
>>       ARANGO_ROOT_PASSWORD: rootpassword
>>     ports:
>>       - 8529:8529
>>     volumes:
>>       - arangodb_data_container:/var/lib/arangodb3
>>       - arangodb_apps_data_container:/var/lib/arangodb3-apps
>>
>> volumes:
>>   arangodb_data_container:
>>   arangodb_apps_data_container:
>>
>>
>> The data has 6 different node types with all of the types having between 
>> 4-7 fields. All of the fields being added are indexed with one field being 
>> indexed as unique.
>>
>>  
>>
>> On Wednesday, February 19, 2020 at 8:17:30 AM UTC-7, Ingo Friepoertner 
>> wrote:
>>>
>>> Hi Rob,
>>>
>>> can you please share some more details? 
>>> Do you use a local deployment, single server or cluster? What are the 
>>> numbers you get and for what amount of data?
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/633f23df-6c1c-498a-8cb7-6b828d5fe4af%40googlegroups.com.

[arangodb-google] Re: Arango newbee trying to understand poor performance

Reply via email to