[arangodb-google] Re: Loop through whole collection

Cyprien Gottstein Fri, 21 Dec 2018 01:18:53 -0800

Well, i'm not working for OrientDB and i don't know every detail of their 
storage engine, but according to their documentation.


Every time you create a class (a collection in ArangoDB), OrientDB will 
attribute a number of cluster to store that particular class. This number 
is generally based on the number of available cpus of the computer running 
OrientDB (they say its to enable better multithreading, i don't know what 
they do on their side but i could definitely perform some multithreading 
thanks to the RIDs).

Each time you insert a new record into the class, it is distributed in a 
round robin fashion into the clusters. Also, when you delete a record, its 
RID will not be reused and there is no tool to do RIDs defragmentation.

One could argue that it leads to bad disk usage but RID doesn't directly 
map to physical pointer address, I believe they use a mapping between the 
RID and the physical positition. That way, the disk space is reused but the 
RID stays in "append-only", meaning that if i reach the RID #10:100, 
whatever happens in terms of deletion/insertion, i know that I don't need 
to iterate the records under #10:100.

We have some special process in which we need to iterate through the whole 
collection to update pretty much ALL of the records.
This process is not transactionnal in its globality as we are doing 
multiple little transaction, there is no "resume" on OrientDB side, what we 
resume here is our process looping through the whole collection. We save a 
RID cursor in memory to know at what point we are into our collection, 
everytime we iterate we launch a new query with the apprioriate range rid 
and in the end we loop through the whole collection.

We could use intermediary commits from ArangoDB that would be just about 
the same thing we are doing besides memory consumption. In OrientDB with 
RIDs you only select the data bloc you want, corresponding to the RIDs 
range you gave, and that's it done, nothing else has to be loaded.

As an example it would look like this :

SELECT * from MyClass WHERE @rid >= #10:0 AND @rid < #10:10
SELECT * from MyClass WHERE @rid >= #10:10 AND @rid < #10:20
SELECT * from MyClass WHERE @rid >= #10:20 AND @rid < #10:30
SELECT * from MyClass WHERE @rid >= #10:30 AND @rid < #10:40
SELECT * from MyClass WHERE @rid >= #10:40 AND @rid < #10:50

It may sound strange but it scales really well, i made it run on collection 
with more than 10 millions of records and there was no slowdown.

Le vendredi 21 décembre 2018 01:18:01 UTC+1, Simran Brucherseifer a écrit :
>
> ... but in one single continuous timelapse. With the appropriate 
>> pagination (the infamous RIDs) we could pause our process and resume it 
>> later on (for whatever reason, crash of the system, connectivity problem 
>> and such).
>>
>
> Cursors have a TTL, but in the event of a server crash they would be gone 
> regardless of the driver as far as I know.
> I'm curious how RIDs are implemented; is it possible to resume any kind of 
> query or only queries which basically return documents as-is?
>  
>
> ArangoExport could also be an option, we would have to change a bit our 
>> approach and build some kind of workaround since it's not available from 
>> the java driver but this seems feasible
>>
>
> There is some work on the way to exploit the sortedness of the primary 
> index using the RocksDB storage engine:
> https://github.com/arangodb/arangodb/pull/7788
> That in combination with a streaming cursor and a simple FOR doc IN coll 
> RETURN doc query, it might be a possible alternative to arangoexport.
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Loop through whole collection

Reply via email to