[arangodb-google] Re: Parallelizing JSON import

Jan Tue, 28 May 2019 05:22:50 -0700

Hi,

it depends a bit on how big the documents are.
For smaller documents it will make sense to insert/import data with 
multiple parallel client threads.


If the documents are "big" and writing them to the storage engine becomes 
the bottleneck, then parallelizing the insert/import will not help so much.

You may try out how much parallelization will help you by importing data in 
parallel using the bundled arangoimport binary.
arangoimport provides an option `--threads`, which defaults to 2. You can 
try modifying the values for this option from 1 to whatever upper bound you 
think could make sense to see if there is any difference in the runtime of 
the import process.

Apart from this, it will very likely make sense to insert documents in 
parallel if the single-document APIs are used. This is because the actual 
insertion time will only be a small fraction of each request, and a great 
deal of time will be spent for processing requests, putting together 
responses and waiting for the network. Here parallelization should help a 
lot.
It may be different if you are already sending multiple documents to the 
server in a single batch, e.g. using the import API at POST /_api/import, 
or by sending an array of documents to POST /_api/document. Here the server 
may already be quite busy, but maybe parallelization can still help at 
least to some extent here.

I suggest trying with arangoimport first to assess the potential benefits 
(if any).
If you are using, please use the import format that has a single JSON 
document per line (jsonl).

Best regards
Jan



Am Dienstag, 28. Mai 2019 14:02:33 UTC+2 schrieb Andreas Jung:
>
> We are currently using ArangoDB as a migration database (100.000 JSON 
> files, 50 GB data, about 25% of the  JSON files contain base64 encoded 
> images, PDF files etc.).
> I wrote a custom import script for the data that takes about 90 minutes 
> for the import using pyArango - one JSON file at a time...working nicely so 
> far.
> Question: would it make sense parallelize the import in order to speed up 
> the import process? Or is the performance of ArangoDB CPU/IO bound for such 
> mass imports?
> We are running a standard standalone installation of ArangoDB 3.4.5 on a 
> local SDD...no fancy setup.
>
> Andreas
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/1dcc802b-6c67-42e1-9552-c21eb043f24d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Parallelizing JSON import

Reply via email to