(with apologies for forgetting to cc Rene on my previous message) On Wed, 3 Jan 2024 at 19:45, James Addison <j...@jp-hosting.net> wrote: > > On Wed, 3 Jan 2024 at 16:59, Thorsten Behrens <t...@libreoffice.org> wrote: > > > > Source: clucene-core > > Followup-For: Bug #1059805 > > X-Debbugs-Cc: r...@debian.org, t...@libreoffice.org > > > > James Addison wrote: > > > And so a question: could the fix be achieved by changing the default > > > value of the version field from millis-timestamp to zero -- meaning: > > > without needing to adjust the API? > > > > > Note that we _only_ need this during build time (so another option > > would be using the custom clucene during build, but still link the > > system clucene for runtime). > > > > I simply don't know enough about the motivations to have this > > pseudo-random value there in the first place, to opine on whether the > > default can/should be changed... > > Agreed. I'm trying to trace the origins of that behaviour. [ ... snip ... ] > This timestamping was introduced[1] in Lucene 1.9RC1 back in Y2005 - > the changelog entry[2] doesn't tell us much more, although we can > imply that it was for a timing/refreshness related fix, because the > change adds an isCurrent method (described as useful to check whether > an IndexReader has an up-to-date object reference to an index/segment) > and test coverage. There is a subsequent race-condition fixup[3]. [ ... snip ... ]
Squinting more at the test coverage, a hypothetical explanation I can think of is this: If two index writer processes were likely to recreate an index file near the same time, or in the presence of long-lived index reader processes, then with a zero-based counter, a reader could easily mistake two independently-created index segments (created with segment version zero) as being 'current'. I don't know for certain that this is the scenario that the timestamp is intended to mitigate against -- it's certainly also possible that Solr replication (noted elsewhere in the LUCENE-3607 JIRA thread) was just as important, or maybe moreso. However: libreoffice is, as I understand it, using this during one-time build processes, and does not require replication of those indexes. I think the safe middle-ground rather than resetting to zero during a reproducible build would be to use the value of SOURCE_DATE_EPOCH[1]. That should mean no API changes, and we're still using a timestamp, keeping us close to the current behaviour -- but in a more reproducible way. There's also a useful note in LUCENE-3607 that the selected segment merge algorithm in Lucene is relevant when attempting to build indexes reproducibly. Fortunately from a quick inspection, I think that clucene-core uses SerialMergeScheduler (non-concurrent, recommended for reproducibility) by default[2] already. [1] - https://reproducible-builds.org/docs/source-date-epoch/ [2] - https://sources.debian.org/src/clucene-core/2.3.3.4+dfsg-1.1/src/core/CLucene/index/IndexWriter.cpp/?hl=185#L185