[arangodb-google] Re: arangoimp performance issues / how can we optimize arangoimp?

Dave Challis Wed, 05 Jun 2019 03:09:36 -0700

I had a quick look (since we're also evaluating ArangoDB at the moment).

It looks like your import is slow due to large number of duplicate IDs in 
your dataset (there are only 50000 unique IDs in the 775783 file).


Filtering out duplicates before importing would definitely help, otherwise 
sorting the input also helps, I ran this locally with 2 threads:

$ sort dxids-cleaned.json > sorted-dxids-cleaned.json
$ time arangoimp --file sorted-dxids-cleaned.json --type jsonl --collection 
dxid --progress true --server.authentication false --create-collection true 
--overwrite true --on-duplicate ignore

created:          50000
warnings/errors:  394
updated/replaced: 0
ignored:          725389
real    1m 52.74s
user    0m 0.12s
sys    0m 0.19s

On Wednesday, 5 June 2019 10:31:13 UTC+1, Akshay Surve wrote:
>
> Hi,
>
> We were evaluating Arrango DB for our usecase and it looked very promising 
> till we hit some blockers. So, wanted share it with the community and see 
> if we could possibly change anything about our approach.
>
> System:
> - Running Arangodb 3.4.5 as a Docker instance
> - Using rocksdb
> - OSX, 16GB RAM
>
> Usecase:
> ~775K user GUIDs; which we wanted to bulk import. You can see some sample 
> values below
>
> We stumbled upon 2 blockers:
>
> 1. arrangoimp wasn't able to process a large JSON file and would get 
> stuck. (The JSON file was identical to the jsonl file linked below except 
> that here we had a array of JSON objects in a single row.
>
> $ docker exec -i 9ecdb1b73004 arangoimp --server.password XXXXX --file 
> "dxids.json" --type json --collection events --progress true --threads 4 
> --on-duplicate ignore
>
> Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.4.5, 
> database: '_system', username: 'root'
>
> ----------------------------------------
>
> database:               _system
>
> collection:             events
>
> create:                 no
>
> create database:        no
>
> source filename:        dxids.json
>
> file type:              json
>
> threads:                4
>
> connect timeout:        5
>
> request timeout:        1200
>
> ----------------------------------------
>
> Starting JSON import...
>
>
> 2019-06-05T09:05:40Z [321] ERROR error message(s):
>
> 2019-06-05T09:05:40Z [321] ERROR import file is too big. please increase 
> the value of --batch-size (currently 1048576)
>
> We kept getting an error saying please increase batch size. While we kept 
> increasing batch size - and it started to process but eventually would get 
> stuck at 99% (and we had kept it running for 2-3 hours) without success.
> Eg:
>
> 2019-06-05T09:06:36Z [375] INFO processed 34634719 bytes (93%) of input 
> file
>
> 2019-06-05T09:06:36Z [375] INFO processed 35748797 bytes (96%) of input 
> file
>
> 2019-06-05T09:06:36Z [375] INFO processed 36895642 bytes (99%) of input 
> file
>
>
> 2. We changed the file to jsonl representation. This time around it 
> atleast processes but takes close to 50-70mins to finish processing. 
>
> Here is some stats on our data:
>
> $ wc -l dxids-cleaned.json
>
>   775783 dxids-cleaned.json
>
> $ head dxids-cleaned.json 
>
> {"_key":"ca7c1b92-962f-482b-8be1-d3888686aee9"}
>
> {"_key":"a54432a0-15c8-46d2-8f67-21c928c385cf"}
>
> {"_key":"c6aa3a49-0d56-4c31-b0f5-32ca88725fff"}
>
> {"_key":"19a207fc-7fcb-4dee-9789-146d5fc7ed0a"}
>
> {"_key":"08e9b852-c4fd-4ff1-83bb-9aaf6e7f837f"}
>
> {"_key":"d6e88e54-cf1f-4566-9ffd-e43aeb3b6767"}
>
> {"_key":"717a99d2-1985-4af1-ab09-4210324c1c83"}
>
> {"_key":"a6377fc2-11bc-4d3c-9c54-ae4f12e7b439"}
>
> {"_key":"a6249b90-a055-4f36-94c7-b16765c8d654"}
>
> {"_key":"2261b38b-a75e-4e6d-b50e-9715a52c6e33"}
>
> Source: https://testfkej2fb945.s3.amazonaws.com/dxids-cleaned.json.zip
>
> Here is how we are initiating the import:
>
> $ docker exec -i f295a1638892 arangoimp --server.password XXXXX --file 
> "dxids-cleaned.json" --type jsonl --collection events --progress true 
> --threads 4 --on-duplicate ignore
>
> Finally, I wanted to understand:
>
> 1. If we can tweak the approach?
> 2. Is 40-60mins time which it takes to process in the expected range? We 
> bulk ingested in Neo4j and it took a few mins. I'm simply curious as we are 
> doing this evaluation for our internal usecase. 
>
> Best,
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/19ba95d0-1ca9-4876-8ffa-65819aa166fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: arangoimp performance issues / how can we optimize arangoimp?

Reply via email to