[arangodb-google] Re: Read-only external process access to a live server's storage

Jan Fri, 24 Feb 2017 07:55:01 -0800

Hi,

the collection datafiles written by ArangoDB are not indexed. The hash 
indexes that can be created on data are in-memory indexes. The index 
description is stored on disk, but the acutal index is in memory. Thus the 
only way for an external process to find a document in the datafiles of a 
collection is to scan them all until the sought document is found. I don't 
recommend doing this.


Is there a way for the external processes to use an existing ArangoDB API, 
which will be quicker and also safer (because ArangoDB can then handle any 
locking and won't free any resources the external process requires)? Maybe 
there are some suitable APIs that can extract several documents at a time 
so that the HTTP overhead can be reduced and becomes acceptable. 
Can you let us know what kind of search operations you actually want on the 
graph data?

Best regards
Jan

Am Freitag, 24. Februar 2017 15:10:14 UTC+1 schrieb Alexandre Rostovtsev:
>
> Thanks a lot for the explanation.
>
> On Friday, February 24, 2017 at 5:50:12 AM UTC-5, Jan wrote:
>>
>> Hi,
>>
>> in general that would be possible, because another process could simply 
>> open a collection's memory-mapped datafiles and read them.
>>
>> Still there are a few issues with this approach:
>> - all data is written into the write-ahead log (WAL) first, and then 
>> eventually transferred to the datafiles of a collection. If an external 
>> reader wants to process all data of a collection by reading the 
>> collection's datafiles, it must trigger a WAL flush first and wait until 
>> all data has been transferred from the WAL to the datafiles.
>> - the server may compact existing datafiles of collections at any time. 
>> Datafiles may become obsolete because of the compaction, and will get 
>> deleted eventually. ArangoDB is not aware of external processes reading its 
>> datafiles, so an external process may crash when reading a datafile that 
>> ArangoDB is currently unmapping or physically deleting. A simple fix for 
>> this is to turn off the compaction for a collection, but that will lead to 
>> ever-growing datafiles for this collection. This may not be a problem if 
>> there are only few update/remove operations on this collection.
>> - the procedure relies on the datafiles being arranged in a certain way, 
>> with certain storage format. The storage format of ArangoDB may change in 
>> future versions and external processes that read ArangoDB's datafiles may 
>> need to be adjusted then.
>>
>
> The usage pattern is frequent reads, rare updates, so waiting for WAL 
> flush and compaction to finish is no problem. And of course we realize we 
> will be relying on a particular Arango version's storage format.
>  
>
>> - the datafiles written by ArangoDB are append-only and are not indexed. 
>> Using datafiles to quickly locate documents or connected documents is not 
>> ideal without maintaining a separate index in the external process (which 
>> effectively requires reading the collection datafiles completely once to 
>> build up the index first).
>>
>
> I was hoping that Arango's own hash indexes are also stored on disk, and 
> we could make use of them. Or is that not possible?
>  
>
>> As there are several disadvantages with this approach, I suggest looking 
>> for an alternative. Is it possible to run the graph traversals/document 
>> lookups in JavaScript inside the server, and expose that over a REST API? 
>> That would minimize the number of HTTP requests and does not require any 
>> modifications to the server code. 
>>
>
> Unfortunately, the traversal filters will make use of extremely large 
> non-Arango data files that live on other nodes in the cluster. Keeping the 
> graph database and the other raw data together on one node would be 
> difficult.
>
> Best regards
>> Jan
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Read-only external process access to a live server's storage

Reply via email to