(with apologies for forgetting to cc Rene on my previous message)

On Wed, 3 Jan 2024 at 19:45, James Addison <j...@jp-hosting.net> wrote:
>
> On Wed, 3 Jan 2024 at 16:59, Thorsten Behrens <t...@libreoffice.org> wrote:
> >
> > Source: clucene-core
> > Followup-For: Bug #1059805
> > X-Debbugs-Cc: r...@debian.org, t...@libreoffice.org
> >
> > James Addison wrote:
> > > And so a question: could the fix be achieved by changing the default
> > > value of the version field from millis-timestamp to zero -- meaning:
> > > without needing to adjust the API?
> > >
> > Note that we _only_ need this during build time (so another option
> > would be using the custom clucene during build, but still link the
> > system clucene for runtime).
> >
> > I simply don't know enough about the motivations to have this
> > pseudo-random value there in the first place, to opine on whether the
> > default can/should be changed...
>
> Agreed.  I'm trying to trace the origins of that behaviour.
[ ... snip ... ]
> This timestamping was introduced[1] in Lucene 1.9RC1 back in Y2005 -
> the changelog entry[2] doesn't tell us much more, although we can
> imply that it was for a timing/refreshness related fix, because the
> change adds an isCurrent method (described as useful to check whether
> an IndexReader has an up-to-date object reference to an index/segment)
> and test coverage.  There is a subsequent race-condition fixup[3].
[ ... snip ... ]

Squinting more at the test coverage, a hypothetical explanation I can
think of is this:

If two index writer processes were likely to recreate an index file
near the same time, or in the presence of long-lived index reader
processes, then with a zero-based counter, a reader could easily
mistake two independently-created index segments (created with segment
version zero) as being 'current'.

I don't know for certain that this is the scenario that the timestamp
is intended to mitigate against -- it's certainly also possible that
Solr replication (noted elsewhere in the LUCENE-3607 JIRA thread) was
just as important, or maybe moreso.

However: libreoffice is, as I understand it, using this during
one-time build processes, and does not require replication of those
indexes.

I think the safe middle-ground rather than resetting to zero during a
reproducible build would be to use the value of SOURCE_DATE_EPOCH[1].
That should mean no API changes, and we're still using a timestamp,
keeping us close to the current behaviour -- but in a more
reproducible way.

There's also a useful note in LUCENE-3607 that the selected segment
merge algorithm in Lucene is relevant when attempting to build indexes
reproducibly.  Fortunately from a quick inspection, I think that
clucene-core uses SerialMergeScheduler (non-concurrent, recommended
for reproducibility) by default[2] already.

[1] - https://reproducible-builds.org/docs/source-date-epoch/

[2] - 
https://sources.debian.org/src/clucene-core/2.3.3.4+dfsg-1.1/src/core/CLucene/index/IndexWriter.cpp/?hl=185#L185

Reply via email to