[arangodb-google] arangoimp performance issues / how can we optimize arangoimp?

Akshay Surve Wed, 05 Jun 2019 02:31:18 -0700

Hi,

We were evaluating Arrango DB for our usecase and it looked very promising 
till we hit some blockers. So, wanted share it with the community and see 
if we could possibly change anything about our approach.

System:
- Running Arangodb 3.4.5 as a Docker instance
- Using rocksdb
- OSX, 16GB RAM

Usecase:
~775K user GUIDs; which we wanted to bulk import. You can see some sample
values below

We stumbled upon 2 blockers:

1. arrangoimp wasn't able to process a large JSON file and would get stuck.
(The JSON file was identical to the jsonl file linked below except that
here we had a array of JSON objects in a single row.

$ docker exec -i 9ecdb1b73004 arangoimp --server.password XXXXX --file
"dxids.json" --type json --collection events --progress true --threads 4
--on-duplicate ignore

Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.4.5, database:
'_system', username: 'root'

----------------------------------------

database: _system

collection: events

create: no

create database: no

source filename: dxids.json

file type: json

threads: 4

connect timeout: 5

request timeout: 1200

----------------------------------------

Starting JSON import...

2019-06-05T09:05:40Z [321] ERROR error message(s):

2019-06-05T09:05:40Z [321] ERROR import file is too big. please increase
the value of --batch-size (currently 1048576)

We kept getting an error saying please increase batch size. While we kept
increasing batch size - and it started to process but eventually would get
stuck at 99% (and we had kept it running for 2-3 hours) without success.
Eg:

2019-06-05T09:06:36Z [375] INFO processed 34634719 bytes (93%) of input file

2019-06-05T09:06:36Z [375] INFO processed 35748797 bytes (96%) of input file

2019-06-05T09:06:36Z [375] INFO processed 36895642 bytes (99%) of input file

2. We changed the file to jsonl representation. This time around it atleast
processes but takes close to 50-70mins to finish processing.

Here is some stats on our data:

$ wc -l dxids-cleaned.json

775783 dxids-cleaned.json

$ head dxids-cleaned.json

{"_key":"ca7c1b92-962f-482b-8be1-d3888686aee9"}

{"_key":"a54432a0-15c8-46d2-8f67-21c928c385cf"}

{"_key":"c6aa3a49-0d56-4c31-b0f5-32ca88725fff"}

{"_key":"19a207fc-7fcb-4dee-9789-146d5fc7ed0a"}

{"_key":"08e9b852-c4fd-4ff1-83bb-9aaf6e7f837f"}

{"_key":"d6e88e54-cf1f-4566-9ffd-e43aeb3b6767"}

{"_key":"717a99d2-1985-4af1-ab09-4210324c1c83"}

{"_key":"a6377fc2-11bc-4d3c-9c54-ae4f12e7b439"}

{"_key":"a6249b90-a055-4f36-94c7-b16765c8d654"}

{"_key":"2261b38b-a75e-4e6d-b50e-9715a52c6e33"}

Source: https://testfkej2fb945.s3.amazonaws.com/dxids-cleaned.json.zip

Here is how we are initiating the import:

$ docker exec -i f295a1638892 arangoimp --server.password XXXXX --file
"dxids-cleaned.json" --type jsonl --collection events --progress true
--threads 4 --on-duplicate ignore

Finally, I wanted to understand:

1. If we can tweak the approach?
2. Is 40-60mins time which it takes to process in the expected range? We
bulk ingested in Neo4j and it took a few mins. I'm simply curious as we are
doing this evaluation for our internal usecase.

Best,

--
You received this message because you are subscribed to the Google Groups
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/arangodb/21f77454-38a1-4553-99d0-49116a1a0e8d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] arangoimp performance issues / how can we optimize arangoimp?

Reply via email to