Re: [Neo] merging databases to enable bulk load on live database

Johan Svensson Fri, 30 Oct 2009 06:11:04 -0700

Hi,

It would be possible to implemented it but as I said it would require
a lot of work. The hard part is to merge the bulk loaded data set in a
safe *transactional* way.


The problem I see with it is that the data you are bulk loading have
to be independent of the rest of the graph. This is rarely the case
(most likely you will have connections to already existing nodes) so I
do not see a common use case for this. Another thing is if you do have
a separate graph you can just bulk load it in a separate db then have
multiple NeoService instances in your application and emulate a link
between them using a property.

Another possible way for you to solve this is to put the application
in read only mode while you bulk load the data (using the same db).
Shutdown the normal NeoService, create read-only one for the
application to use during the bulk load then switch back to a
read/write NeoService once the loading is done. The problem with that
approach is if the application crashes in the middle of a batch insert
your db may be corrupt.

Regards,
-Johan

On Thu, Oct 29, 2009 at 10:51 PM, Craig Taverner <cr...@amanzi.com> wrote:
> Hi,
>
> Recently Johan pointed out to me that merging databases was not something
> easily added to the core neo4j because there would be clashes of ids.
> However, I have thought a bit more about it, and I think there is a way this
> can work (in theory) for special cases.
>
> Here is my example:
>
>   - We have a main database that contains most data. Node Id's presumably
>   increment from 1 and up as data is added.
>   - We wish to bulk load data into this database while it is still active
>   (not allowed by bulk load)
>   - We instead bulk load into a completely new database, but one where the
>   id's are set to start at a high number, set well above the current max-id of
>   the main database
>   - Once bulk load is finished, we new database is appended to the main one
>   with no id clashes
>
> Issues I can imagine with this approach:
>
>   - We need to be sure the main database does not grow enough to create an
>   id clash during the time of the bulk load. I would think the offset is
>   something the application programmer needs to choose based on knowledge of
>   the behaviour of the application in this regard. For example, if the only
>   way the application adds large numbers of nodes is through the bulk load,
>   the programmer knows the main ids will not grow as long as only one bulk
>   load runs at a time. Worst case scenario, the merge is rejected and the
>   application has to repeat the entire load with a new offset.
>   - If id's are actually pure array indexes to the data, then current neo4j
>   code would not support indexing from a high number. I would imagine it easy
>   to have a database wide offset to deal with this.
>   - If id's are array indexes, after the merge there would be a possibly
>   large chunk of space in the database (after the end of the main data and
>   before the beginning of the new data).
>   - This trick is probably necessary for all database files, nodes,
>   relationships and properties.
>   - After the merge the application code needs all Node instances to still
>   be valid, for both databases, but presumably the NeoService instance needs
>   to be merged to point to one object. The application node should also take
>   care to create appropriate links between old and new data, assuming that is
>   needed by the application (it is in my case).
>
> What do people think of this approach? If it is possible, it would certainly
> solve my problem with needing to run bulk loads on a live database. I
> personally do not think the additional API components would be too complex.
> In fact all that is needed are three API additions:
>
>   - Ability to get the current max node id from the main database (perhaps
>   already exists?)
>   - Ability to set the offset for the ids in the EmbeddedNeo() constructor
>   - Method on NeoService (or probably only on EmbeddedNeo) for merging in
>   another database, for example, mainNeo.append(tempNeo). This would
>   presumably return a boolean, or throw exceptions. I expect the tempNeo
>   instance becomes invalid after this call (or points to the same database as
>   mainNeo). Node instances on either database need to remain valid (but point
>   to the main database), so we can add relations immediately to link the
>   datasets correctly.
>
> Cheers, Craig
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] merging databases to enable bulk load on live database

Reply via email to