Hi,
We were evaluating Arrango DB for our usecase and it looked very promising
till we hit some blockers. So, wanted share it with the community and see
if we could possibly change anything about our approach.
System:
- Running Arangodb 3.4.5 as a Docker instance
- Using rocksdb
- OSX, 16GB RAM
Usecase:
~775K user GUIDs; which we wanted to bulk import. You can see some sample
values below
We stumbled upon 2 blockers:
1. arrangoimp wasn't able to process a large JSON file and would get stuck.
(The JSON file was identical to the jsonl file linked below except that
here we had a array of JSON objects in a single row.
$ docker exec -i 9ecdb1b73004 arangoimp --server.password XXXXX --file
"dxids.json" --type json --collection events --progress true --threads 4
--on-duplicate ignore
Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.4.5, database:
'_system', username: 'root'
----------------------------------------
database: _system
collection: events
create: no
create database: no
source filename: dxids.json
file type: json
threads: 4
connect timeout: 5
request timeout: 1200
----------------------------------------
Starting JSON import...
2019-06-05T09:05:40Z [321] ERROR error message(s):
2019-06-05T09:05:40Z [321] ERROR import file is too big. please increase
the value of --batch-size (currently 1048576)
We kept getting an error saying please increase batch size. While we kept
increasing batch size - and it started to process but eventually would get
stuck at 99% (and we had kept it running for 2-3 hours) without success.
Eg:
2019-06-05T09:06:36Z [375] INFO processed 34634719 bytes (93%) of input file
2019-06-05T09:06:36Z [375] INFO processed 35748797 bytes (96%) of input file
2019-06-05T09:06:36Z [375] INFO processed 36895642 bytes (99%) of input file
2. We changed the file to jsonl representation. This time around it atleast
processes but takes close to 50-70mins to finish processing.
Here is some stats on our data:
$ wc -l dxids-cleaned.json
775783 dxids-cleaned.json
$ head dxids-cleaned.json
{"_key":"ca7c1b92-962f-482b-8be1-d3888686aee9"}
{"_key":"a54432a0-15c8-46d2-8f67-21c928c385cf"}
{"_key":"c6aa3a49-0d56-4c31-b0f5-32ca88725fff"}
{"_key":"19a207fc-7fcb-4dee-9789-146d5fc7ed0a"}
{"_key":"08e9b852-c4fd-4ff1-83bb-9aaf6e7f837f"}
{"_key":"d6e88e54-cf1f-4566-9ffd-e43aeb3b6767"}
{"_key":"717a99d2-1985-4af1-ab09-4210324c1c83"}
{"_key":"a6377fc2-11bc-4d3c-9c54-ae4f12e7b439"}
{"_key":"a6249b90-a055-4f36-94c7-b16765c8d654"}
{"_key":"2261b38b-a75e-4e6d-b50e-9715a52c6e33"}
Source: https://testfkej2fb945.s3.amazonaws.com/dxids-cleaned.json.zip
Here is how we are initiating the import:
$ docker exec -i f295a1638892 arangoimp --server.password XXXXX --file
"dxids-cleaned.json" --type jsonl --collection events --progress true
--threads 4 --on-duplicate ignore
Finally, I wanted to understand:
1. If we can tweak the approach?
2. Is 40-60mins time which it takes to process in the expected range? We
bulk ingested in Neo4j and it took a few mins. I'm simply curious as we are
doing this evaluation for our internal usecase.
Best,
--
You received this message because you are subscribed to the Google Groups
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/arangodb/21f77454-38a1-4553-99d0-49116a1a0e8d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.