Re: Reindex and external indexes - Possibility of stale index data

Thomas Mueller Tue, 21 Oct 2014 00:50:01 -0700

Hi,

>It might be simpler if we just record the index creation time in the
>index definition node itself (or some predefined meta node under
>definition node). This can be done in IndexUpdate itself where it
>would set the time when it triggers a reindex or the first index.


Sorry I don't understand. I think the index creation time itself doesn't
need to be recorded anywhere.

What we need is a distinction between the old and the new index *data*. So
that while reindexing is in progress, the old, existing index can still be
used. So we need two distinct directories for the index data (one
directory for the old data, one directory for the new data). The names of
the directories can be arbitrary, it could be UUIDs, or it could be a
counter. My idea was to use the time (milliseconds since 1970) where
reindexing / indexing started as the directory name. That way you know
which *data* is new and which data is old (by comparing the values).

Regards,
Thomas



>
>Later Lucene would make use of that time to create a directory as you
>suggested above and reclaim old directory
>Chetan Mehrotra
>
>
>On Mon, Oct 13, 2014 at 2:33 PM, Thomas Mueller <muel...@adobe.com> wrote:
>> Hi,
>>
>>> Then would use that UUID as the prefix ...
>>
>>
>> Sorry, that should be "Then would use that _time_ as the prefix ..." - I
>> thought about using a UUID first, but then changed to milliseconds since
>> 1970, as that's easier (you immediately see which one is the latest
>> directory). But UUID would work as well.
>>
>> Regards,
>> Thomas
>>
>>
>>
>> On 13/10/14 10:45, "Thomas Mueller" <muel...@adobe.com> wrote:
>>
>>>Hi,
>>>
>>>As for external Lucene indexes, what about this:
>>>
>>>* in the ":data" node, store a index creation time, in milliseconds
>>>since
>>>1970
>>>* use that as a path prefix for the actual index files
>>>
>>>So if the index is configured as follows:
>>>
>>>  /oak:index/lucene { path: "/quickstart/repo/lucenIndex" }
>>>
>>>Then internally, Oak Lucene would create a node
>>>
>>>  /oak:index/lucene/:dataInProgress { time: 1413189793297 }
>>>
>>>Then would use that UUID as the prefix for the directory, and the index
>>>is
>>>created in:
>>>
>>>  /quickstart/repo/lucenIndex/1413189793297
>>>
>>>When the index is built, the node ":dataInProgress" is renamed to
>>>":data":
>>>
>>>  /oak:index/lucene/:data { time: 1413189793297 }
>>>
>>>To read, this the directory would be used. When reindexing, then
>>>temporarily two nodes and directories would exist:
>>>
>>>  /oak:index/lucene/:data { time: 1413189793297 }
>>>  /oak:index/lucene/:dataInProgress { time: 1413189822022 }
>>>
>>>  /quickstart/repo/lucenIndex/1413189793297
>>>
>>>  /quickstart/repo/lucenIndex/1413189822022
>>>
>>>Once the index is done, in one transaction, the old ":data" node is
>>>removed and the node ":dataInProgress" is removed to ":data". Then the
>>>old
>>>directories are removed.
>>>
>>>You can only reindex once per millisecond, but I guess this isn't a
>>>problem.
>>>
>>>Regards,
>>>Thomas
>>>
>>>
>>>
>>>
>>>
>>>
>>>On 13/10/14 10:29, "Alex Parvulescu" <alex.parvule...@gmail.com> wrote:
>>>
>>>>Hi,
>>>>
>>>>
>>>>> If we set reindex to true in any index definition then Oak would
>>>>> remove the existing index content before performing the reindex. This
>>>>> would work fine if the index content are stored within NodeStore
>>>>> itself.
>>>>
>>>>It is important to also specify that this appears as a single commit
>>>>thanks
>>>>to the mvcc model: (delete + set reindexed index) so there's no
>>>>downtime
>>>>to
>>>>speak of, the original index is available during the reindex process.
>>>>
>>>>
>>>>> However if the index are stored externally e.g. Solr or Lucene index
>>>>> with persistence set to filesystem then I think currently we do not
>>>>> the remove the existing index data which might lead to index
>>>>> containing stale data.
>>>>
>>>>Agreed, this is a problem when storing the index outside the repo. The
>>>>interesting part here is that only content updates might be affected,
>>>>deleting a node will not resurface it thanks to the fact that the query
>>>>engine will reload nodes to see if they are readable to the current
>>>>session
>>>>(acl checks) so it skips over the nodes it can't read, if I remember
>>>>correctly.
>>>>
>>>>Focusing on the Lucene index now, I went through the code a bit (no
>>>>proper
>>>>tests yet) and it looks like it might not be affected by this that
>>>>much.
>>>>A
>>>>reindex call has the before state empty so Lucene will update all the
>>>>documents it finds [0], so no stale content on updates here. Just
>>>>missing
>>>>deleted node events.
>>>>So the remaining question is about identifying content that was deleted
>>>>between the indexed state and the current head state. One simple
>>>>solution
>>>>is to run a 'remove all documents query' on the lucene index, but that
>>>>has
>>>>the downside of making the index unusable during the time the indexing
>>>>process runs, so I don't see it as a really good option, only maybe as
>>>>a
>>>>fallback of sorts.
>>>>
>>>>
>>>>> Should we provide any sort of callback for indexers when reindex is
>>>>requested?
>>>>Thinking about this a bit, there's a simpler way of handling a reindex
>>>>call. If you really need to know that the current index is actually a
>>>>reindex call, you can check if the before state is the empty one on the
>>>>root index editor.
>>>>
>>>>best,
>>>>alex
>>>>
>>>>[0]
>>>>https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main
>>>>/j
>>>>a
>>>>va/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndexEditor.jav
>>>>a#
>>>>L
>>>>109
>>>>
>>>>
>>>>
>>>>On Mon, Oct 13, 2014 at 7:33 AM, Chetan Mehrotra
>>>><chetan.mehro...@gmail.com>
>>>>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> If we set reindex to true in any index definition then Oak would
>>>>> remove the existing index content before performing the reindex. This
>>>>> would work fine if the index content are stored within NodeStore
>>>>> itself.
>>>>>
>>>>> However if the index are stored externally e.g. Solr or Lucene index
>>>>> with persistence set to filesystem then I think currently we do not
>>>>> the remove the existing index data which might lead to index
>>>>> containing stale data.
>>>>>
>>>>> Should we provide any sort of callback for indexers when reindex is
>>>>> requested?
>>>>>
>>>>> Chetan Mehrotra
>>>>>
>>>
>>

Re: Reindex and external indexes - Possibility of stale index data

Reply via email to