> We have written a test harness to evaluate performance of a number of graph alternatives.
> The original snippet of code is not part of the harness, but was an example of how we are > adding data through the java driver. Because we are currently using neo4j, that was the > initial implementation for that harness. The test consists of adding 2M nodes using > batches/transactions of 500. Tests are being run initially on dev laptops with the end goal > of running all of the tests on a single, more powerful environment. With neo, we are able > to add the 2M nodes in roughly 6-7 minutes. With Arangodb, we are in the neighborhood > of 35-40 minutes for the same data so as you can see, this is a pretty dramatic difference. Hi Rob, there are different approaches to improve the performance considerably. (1) Single Document Operation The initial program you provided will use a single document operation for each vertex and edge that will be inserted. This is based on the synchronous driver so that it will not run in parallel. In order to make better use of the server, you can use threads in Java to create the vertices and edges in parallel. Also, in his example program, we raised the transaction size to 500. You can find Michele’s version here: https://gist.github.com/rashtao/831c7e0281314789a2e2b57e8b3bfe67 This is just a proof of concept and not production code quality. With this setup we get on a laptop: 1200000 vertexes 800000 edges elapsed: 195275 ms 10241 insertions/s This is roughly 10x faster than your numbers. Obviously, this is not the same test environment, but the laptop we used is not the fastest. The drawbacks of this approach are a lot of communication between the client and the server. To fully make use of batches the following approach will help. (2) Insert using AQL You can use AQL to insert batches of vertices and edges. Michele’s example program can be found here: https://gist.github.com/rashtao/5b72b6187d1a6b50aa129a9f3c5fb2ef With this version we reach the following numbers (on the same laptop as above): 1200000 vertexes 800000 edges elapsed: 72617 ms 27541 insertions/s That is a factor of ~2.5 faster than the previous approach. With this setup, you can import 2 million documents in 1min12sec. Please note, that if you use much larger transaction sizes, you should enable intermediate commits, see https://www.arangodb.com/docs/3.6/transactions-limitations.html#rocksdb-storage-engine (3) Batch Generation of Documents We also provide a specialized API for generating batches of documents. This can be used as an alternative to (2) and allow you to gain even more performance. For example, see https://gist.github.com/rashtao/22a43ba5233669d610eca65e06bc7b87 This gives 1200000 vertexes 800000 edges elapsed: 64428 ms 31042 insertions/s This is slightly faster than using AQL. (4) Import There is also a special API for bulk imports. However, this does not support transactions (see https://www.arangodb.com/docs/stable/http/bulk-imports.html). We can provide more details if required. (5) Outlook We are also working on the next version of the Java driver. This will be reactive and non-blocking on the network side. It will use fewer threads and will do auto-tuning of the parallelism in the client. This will make (1) even easier to implement. If you have any further questions please do not hesitate to ask. Alternatively, we can set up a call to discuss the various options. Michele & Frank On Monday, 24 February 2020 15:18:17 UTC+1, Frank Celler wrote: > > Hi Rob, > > thanks a lot for the details. We will create a similar test environment, > > best Frank > > On Monday, 24 February 2020 14:53:58 UTC+1, Rob Gratz wrote: >> >> >> We have written a test harness to evaluate performance of a number of >> graph alternatives. The original snippet of code is not part of the >> harness, but was an example of how we are adding data through the java >> driver. Because we are currently using neo4j, that was the initial >> implementation for that harness. The test consists of adding 2M nodes >> using batches/transactions of 500. Tests are being run initially on dev >> laptops with the end goal of running all of the tests on a single, more >> powerful environment. With neo, we are able to add the 2M nodes in roughly >> 6-7 minutes. With Arangodb, we are in the neighborhood of 35-40 minutes >> for the same data so as you can see, this is a pretty dramatic difference. >> >> This is a single arangodb instance running in a docker container. Here >> is the docker-compose file being used: >> >> version: '3.7' >> services: >> arangodb_db_container: >> image: arangodb:latest >> environment: >> ARANGO_ROOT_PASSWORD: rootpassword >> ports: >> - 8529:8529 >> volumes: >> - arangodb_data_container:/var/lib/arangodb3 >> - arangodb_apps_data_container:/var/lib/arangodb3-apps >> >> volumes: >> arangodb_data_container: >> arangodb_apps_data_container: >> >> >> The data has 6 different node types with all of the types having between >> 4-7 fields. All of the fields being added are indexed with one field being >> indexed as unique. >> >> >> >> On Wednesday, February 19, 2020 at 8:17:30 AM UTC-7, Ingo Friepoertner >> wrote: >>> >>> Hi Rob, >>> >>> can you please share some more details? >>> Do you use a local deployment, single server or cluster? What are the >>> numbers you get and for what amount of data? >>> >>> -- You received this message because you are subscribed to the Google Groups "ArangoDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/633f23df-6c1c-498a-8cb7-6b828d5fe4af%40googlegroups.com.
