[arangodb-google] Re: Read-only external process access to a live server's storage

Alexandre Rostovtsev Fri, 24 Feb 2017 06:10:58 -0800

Thanks a lot for the explanation.

On Friday, February 24, 2017 at 5:50:12 AM UTC-5, Jan wrote:
>
> Hi,
>
> in general that would be possible, because another process could simply 
> open a collection's memory-mapped datafiles and read them.
>
> Still there are a few issues with this approach:
> - all data is written into the write-ahead log (WAL) first, and then 
> eventually transferred to the datafiles of a collection. If an external 
> reader wants to process all data of a collection by reading the 
> collection's datafiles, it must trigger a WAL flush first and wait until 
> all data has been transferred from the WAL to the datafiles.
> - the server may compact existing datafiles of collections at any time. 
> Datafiles may become obsolete because of the compaction, and will get 
> deleted eventually. ArangoDB is not aware of external processes reading its 
> datafiles, so an external process may crash when reading a datafile that 
> ArangoDB is currently unmapping or physically deleting. A simple fix for 
> this is to turn off the compaction for a collection, but that will lead to 
> ever-growing datafiles for this collection. This may not be a problem if 
> there are only few update/remove operations on this collection.
> - the procedure relies on the datafiles being arranged in a certain way, 
> with certain storage format. The storage format of ArangoDB may change in 
> future versions and external processes that read ArangoDB's datafiles may 
> need to be adjusted then.
>


The usage pattern is frequent reads, rare updates, so waiting for WAL flush 
and compaction to finish is no problem. And of course we realize we will be 
relying on a particular Arango version's storage format.
 

> - the datafiles written by ArangoDB are append-only and are not indexed. 
> Using datafiles to quickly locate documents or connected documents is not 
> ideal without maintaining a separate index in the external process (which 
> effectively requires reading the collection datafiles completely once to 
> build up the index first).
>

I was hoping that Arango's own hash indexes are also stored on disk, and we 
could make use of them. Or is that not possible?
 

> As there are several disadvantages with this approach, I suggest looking 
> for an alternative. Is it possible to run the graph traversals/document 
> lookups in JavaScript inside the server, and expose that over a REST API? 
> That would minimize the number of HTTP requests and does not require any 
> modifications to the server code. 
>

Unfortunately, the traversal filters will make use of extremely large 
non-Arango data files that live on other nodes in the cluster. Keeping the 
graph database and the other raw data together on one node would be 
difficult.

Best regards
> Jan
>

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[arangodb-google] Re: Read-only external process access to a live server's storage

Reply via email to