Hi Dave,
Thanks for looking into this. You are right - sorting helps get this
processed much faster!
It would be nice to know the internal trade-off / doc to understand this
better - please let me know if you can share some insights on this.
On Wednesday, June 5, 2019 at 3:38:59 PM UTC+5:30, Dave Challis wrote:
>
> I had a quick look (since we're also evaluating ArangoDB at the moment).
>
> It looks like your import is slow due to large number of duplicate IDs in
> your dataset (there are only 50000 unique IDs in the 775783 file).
>
> Filtering out duplicates before importing would definitely help, otherwise
> sorting the input also helps, I ran this locally with 2 threads:
>
> $ sort dxids-cleaned.json > sorted-dxids-cleaned.json
> $ time arangoimp --file sorted-dxids-cleaned.json --type jsonl
> --collection dxid --progress true --server.authentication false
> --create-collection true --overwrite true --on-duplicate ignore
>
> created: 50000
> warnings/errors: 394
> updated/replaced: 0
> ignored: 725389
> real 1m 52.74s
> user 0m 0.12s
> sys 0m 0.19s
>
> On Wednesday, 5 June 2019 10:31:13 UTC+1, Akshay Surve wrote:
>>
>> Hi,
>>
>> We were evaluating Arrango DB for our usecase and it looked very
>> promising till we hit some blockers. So, wanted share it with the community
>> and see if we could possibly change anything about our approach.
>>
>> System:
>> - Running Arangodb 3.4.5 as a Docker instance
>> - Using rocksdb
>> - OSX, 16GB RAM
>>
>> Usecase:
>> ~775K user GUIDs; which we wanted to bulk import. You can see some sample
>> values below
>>
>> We stumbled upon 2 blockers:
>>
>> 1. arrangoimp wasn't able to process a large JSON file and would get
>> stuck. (The JSON file was identical to the jsonl file linked below except
>> that here we had a array of JSON objects in a single row.
>>
>> $ docker exec -i 9ecdb1b73004 arangoimp --server.password XXXXX --file
>> "dxids.json" --type json --collection events --progress true --threads 4
>> --on-duplicate ignore
>>
>> Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.4.5,
>> database: '_system', username: 'root'
>>
>> ----------------------------------------
>>
>> database: _system
>>
>> collection: events
>>
>> create: no
>>
>> create database: no
>>
>> source filename: dxids.json
>>
>> file type: json
>>
>> threads: 4
>>
>> connect timeout: 5
>>
>> request timeout: 1200
>>
>> ----------------------------------------
>>
>> Starting JSON import...
>>
>>
>> 2019-06-05T09:05:40Z [321] ERROR error message(s):
>>
>> 2019-06-05T09:05:40Z [321] ERROR import file is too big. please increase
>> the value of --batch-size (currently 1048576)
>>
>> We kept getting an error saying please increase batch size. While we kept
>> increasing batch size - and it started to process but eventually would get
>> stuck at 99% (and we had kept it running for 2-3 hours) without success.
>> Eg:
>>
>> 2019-06-05T09:06:36Z [375] INFO processed 34634719 bytes (93%) of input
>> file
>>
>> 2019-06-05T09:06:36Z [375] INFO processed 35748797 bytes (96%) of input
>> file
>>
>> 2019-06-05T09:06:36Z [375] INFO processed 36895642 bytes (99%) of input
>> file
>>
>>
>> 2. We changed the file to jsonl representation. This time around it
>> atleast processes but takes close to 50-70mins to finish processing.
>>
>> Here is some stats on our data:
>>
>> $ wc -l dxids-cleaned.json
>>
>> 775783 dxids-cleaned.json
>>
>> $ head dxids-cleaned.json
>>
>> {"_key":"ca7c1b92-962f-482b-8be1-d3888686aee9"}
>>
>> {"_key":"a54432a0-15c8-46d2-8f67-21c928c385cf"}
>>
>> {"_key":"c6aa3a49-0d56-4c31-b0f5-32ca88725fff"}
>>
>> {"_key":"19a207fc-7fcb-4dee-9789-146d5fc7ed0a"}
>>
>> {"_key":"08e9b852-c4fd-4ff1-83bb-9aaf6e7f837f"}
>>
>> {"_key":"d6e88e54-cf1f-4566-9ffd-e43aeb3b6767"}
>>
>> {"_key":"717a99d2-1985-4af1-ab09-4210324c1c83"}
>>
>> {"_key":"a6377fc2-11bc-4d3c-9c54-ae4f12e7b439"}
>>
>> {"_key":"a6249b90-a055-4f36-94c7-b16765c8d654"}
>>
>> {"_key":"2261b38b-a75e-4e6d-b50e-9715a52c6e33"}
>>
>> Source: https://testfkej2fb945.s3.amazonaws.com/dxids-cleaned.json.zip
>>
>> Here is how we are initiating the import:
>>
>> $ docker exec -i f295a1638892 arangoimp --server.password XXXXX --file
>> "dxids-cleaned.json" --type jsonl --collection events --progress true
>> --threads 4 --on-duplicate ignore
>>
>> Finally, I wanted to understand:
>>
>> 1. If we can tweak the approach?
>> 2. Is 40-60mins time which it takes to process in the expected range? We
>> bulk ingested in Neo4j and it took a few mins. I'm simply curious as we are
>> doing this evaluation for our internal usecase.
>>
>> Best,
>>
>>
--
You received this message because you are subscribed to the Google Groups
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/arangodb/b3bc0032-fed6-48d0-b6b4-1f280fe8b2ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.