Hi,

My suggestion is to _not_ support "resumable" operations on a large tree,
but instead don't use large operations. But I wouldn't call my solution
"sharding", but more "bit-by-bit reindexing". Some more details: For
indexing (specially synchronous property indexes) I suggest to do the
following, for both a new index, and for reindexing:

1) Reindexing is writing to a new subtree, that is, ":index_1",
":index_2",... Which one is used is (automatically) set in the index
itself at /oak:index/<...>, in a hidden property ":writeNode"

2) Reading (queries) use the old subtree if there is any (":index" right
now). Which one it used is (automatically) set in the index itself at
/oak:index/<...>, in a hidden property ":liveNode"

3) Synchronous index updates are written to the index node defined at
":liveNode". Therefore, at the same time, reindex to a new subtree and
index updates to the old subtree can occur (actually more than that, see
below at 6).

4) To track reindexing/indexing progress, there is a "current position"
persisted at /oak:index/<...>, in a hidden property ":writePath". This
entry is automatically advanced by the asynchronous reindexing thread from
"/" to "/a" to "/b/a" to "/content/a/b", "/content/x", "/system/03/01",
"/system/04", "/system/0a",... and so on, until the whole repository is
indexed.

5) The asynchronous indexing thread reads all nodes of the repository
_after_ what is written at ":writePath" and indexes that to the new
subtree, and once there are 1000 changes to the index, the last indexed
path is written to ":writePath", plus the additions to the index, in one
transaction.

6) Synchronous index updates are also written to the index node defined at
":writeNode", if they affect nodes before the ":writePath".

7) After a restart, reindexing continues where it left off, using the last
":writePath".

8) After reindexing is done, ":livePath" is updated to ":index_2", and
":writePath" is removed. Also, the old index subtree is removed (1000
nodes per commit).

9) Sorting of path is needed, so that the repository can be processed bit
by bit by bit. For that, the following logic is used, recursively: read at
most 1000 child nodes. If there are more than 1000, then this subtree is
never split but processed in one step (so many child nodes can still lead
to large transactions, unfortunately). If less than 1000 child nodes, then
the names of all child nodes are read, and processed in sorted order
(sorted by node name).

Therefore, all reindexing operations use small transactions. Queries can
run concurrently. Reindexing can be paused. Reindexing can continue even
after a restart. While reindexing, the old index is maintained and
up-to-date. The branch-less commit mode is not needed. No conflicts
between the synchronous index updates and asynchronous reindexing thread
are possible.

Regards,
Thomas

Reply via email to