Hi,

> Then would use that UUID as the prefix ...


Sorry, that should be "Then would use that _time_ as the prefix ..." - I
thought about using a UUID first, but then changed to milliseconds since
1970, as that's easier (you immediately see which one is the latest
directory). But UUID would work as well.

Regards,
Thomas



On 13/10/14 10:45, "Thomas Mueller" <muel...@adobe.com> wrote:

>Hi,
>
>As for external Lucene indexes, what about this:
>
>* in the ":data" node, store a index creation time, in milliseconds since
>1970
>* use that as a path prefix for the actual index files
>
>So if the index is configured as follows:
>
>  /oak:index/lucene { path: "/quickstart/repo/lucenIndex" }
>
>Then internally, Oak Lucene would create a node
>
>  /oak:index/lucene/:dataInProgress { time: 1413189793297 }
>
>Then would use that UUID as the prefix for the directory, and the index is
>created in:
>
>  /quickstart/repo/lucenIndex/1413189793297
>
>When the index is built, the node ":dataInProgress" is renamed to ":data":
>
>  /oak:index/lucene/:data { time: 1413189793297 }
>
>To read, this the directory would be used. When reindexing, then
>temporarily two nodes and directories would exist:
>
>  /oak:index/lucene/:data { time: 1413189793297 }
>  /oak:index/lucene/:dataInProgress { time: 1413189822022 }
>
>  /quickstart/repo/lucenIndex/1413189793297
>
>  /quickstart/repo/lucenIndex/1413189822022
>
>Once the index is done, in one transaction, the old ":data" node is
>removed and the node ":dataInProgress" is removed to ":data". Then the old
>directories are removed.
>
>You can only reindex once per millisecond, but I guess this isn't a
>problem.
>
>Regards,
>Thomas
>
>
>
>
>
>
>On 13/10/14 10:29, "Alex Parvulescu" <alex.parvule...@gmail.com> wrote:
>
>>Hi,
>>
>>
>>> If we set reindex to true in any index definition then Oak would
>>> remove the existing index content before performing the reindex. This
>>> would work fine if the index content are stored within NodeStore
>>> itself.
>>
>>It is important to also specify that this appears as a single commit
>>thanks
>>to the mvcc model: (delete + set reindexed index) so there's no downtime
>>to
>>speak of, the original index is available during the reindex process.
>>
>>
>>> However if the index are stored externally e.g. Solr or Lucene index
>>> with persistence set to filesystem then I think currently we do not
>>> the remove the existing index data which might lead to index
>>> containing stale data.
>>
>>Agreed, this is a problem when storing the index outside the repo. The
>>interesting part here is that only content updates might be affected,
>>deleting a node will not resurface it thanks to the fact that the query
>>engine will reload nodes to see if they are readable to the current
>>session
>>(acl checks) so it skips over the nodes it can't read, if I remember
>>correctly.
>>
>>Focusing on the Lucene index now, I went through the code a bit (no
>>proper
>>tests yet) and it looks like it might not be affected by this that much.
>>A
>>reindex call has the before state empty so Lucene will update all the
>>documents it finds [0], so no stale content on updates here. Just missing
>>deleted node events.
>>So the remaining question is about identifying content that was deleted
>>between the indexed state and the current head state. One simple solution
>>is to run a 'remove all documents query' on the lucene index, but that
>>has
>>the downside of making the index unusable during the time the indexing
>>process runs, so I don't see it as a really good option, only maybe as a
>>fallback of sorts.
>>
>>
>>> Should we provide any sort of callback for indexers when reindex is
>>requested?
>>Thinking about this a bit, there's a simpler way of handling a reindex
>>call. If you really need to know that the current index is actually a
>>reindex call, you can check if the before state is the empty one on the
>>root index editor.
>>
>>best,
>>alex
>>
>>[0]
>>https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/j
>>a
>>va/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndexEditor.java#
>>L
>>109
>>
>>
>>
>>On Mon, Oct 13, 2014 at 7:33 AM, Chetan Mehrotra
>><chetan.mehro...@gmail.com>
>>wrote:
>>
>>> Hi,
>>>
>>> If we set reindex to true in any index definition then Oak would
>>> remove the existing index content before performing the reindex. This
>>> would work fine if the index content are stored within NodeStore
>>> itself.
>>>
>>> However if the index are stored externally e.g. Solr or Lucene index
>>> with persistence set to filesystem then I think currently we do not
>>> the remove the existing index data which might lead to index
>>> containing stale data.
>>>
>>> Should we provide any sort of callback for indexers when reindex is
>>> requested?
>>>
>>> Chetan Mehrotra
>>>
>

Reply via email to