Got it, thanks for providing additional context on the use case! On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com> wrote:
> There is an effort underway in Apache Solr where we want to provide a path > to a legitimate upgrade without needing to reindex from source: > https://issues.apache.org/jira/browse/SOLR-17725 > > Essentially the proposal is to read documents from segments where > minVersion < current version and reindex them. At the same time, while > the process is underway, have a custom merge policy which would exclude > such segments from merging with latest version segments to prevent > pollution. > > Result is an index which only contains segments with minVersion and > version stamps the same as the current Lucene version (essentially case #2 > that we discussed). This index would in all respects be an "upgraded" > index, but would need "indexCreatedVersionMajor" to be reset as well. This > is where the Lucene API (to reset "indexCreatedVersionMajor") becomes > essential. > > I believe this is a pattern which can also be adopted by other Lucene > based search engines like Opensearch and Elasticsearch, and hence having > this API could potentially benefit a large Lucene base. > > -Rahul > > On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> wrote: > >> > Consider the following sequence of events... >> an index with 2 segments (seg1 and seg2) originally created in Lucene >> 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets >> created with version 9.x, but merge doesn't kick in ==> documents in seg1 >> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x >> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails. >> >> Thanks for the explanation. I am wondering if this is something that you >> commonly encounter, seems like a bit of an edge case? >> >> Regarding scenario 1, deleting the entire index and recreating it is >> generally faster and less resource intensive instead of deleting all the >> documents. Most systems built on top of Lucene like Solr, OpenSearch, >> Elasticsearch expose delete API for collection/index, and users just delete >> and recreate the index. Probably, one of the reasons it hasn't come up much >> before. Will let other community members chime in on this. >> >> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com> >> wrote: >> >>> For complete clarity..."minVersion" for a SegmentInfo is the min of the >>> minVersions of all segments involved in the merge which resulted in this >>> segment. If it is a "pure" segment, then minVersion=version. >>> >>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <rahul196...@gmail.com> >>> wrote: >>> >>>> Ankit, >>>> "I guess the SegmentInfo "minVersion" is the min across all segments >>>> during the merge process?" >>>> > That is correct >>>> >>>> I am wondering if there is any way to end up in the 2nd scenario, >>>> without having deleted all the documents first? >>>> > Consider the following sequence of events... >>>> an index with 2 segments (seg1 and seg2) originally created in Lucene >>>> 8.x. ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets >>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1 >>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x >>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails. >>>> >>>> -Rahul >>>> >>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com> >>>> wrote: >>>> >>>>> Hi Rahul, >>>>> >>>>> Thanks for starting this interesting discussion. I was initially >>>>> thinking that this API potentially allows upgrading >>>>> "indexCreatedVersionMajor" via the merge process after rewriting all the >>>>> segments, but I guess the SegmentInfo "minVersion" is the min across all >>>>> segments during the merge process? >>>>> >>>>> So, I am wondering if there is any way to end up in the 2nd scenario, >>>>> without having deleted all the documents first? >>>>> >>>>> >>>>> Thanks >>>>> Ankit >>>>> >>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami <rahul196...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> Today even after all documents in an index are deleted via an API >>>>>> call, reindexing still doesn't change the "indexCreatedVersionMajor" >>>>>> property value in SegmentInfos. Hence even after complete reindexing, >>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up with >>>>>> an >>>>>> IndexFormatTooOldException. >>>>>> >>>>>> Requesting an API (on IndexWriter?) which can reset this property >>>>>> (upon a new commit) to the current Lucene version if: >>>>>> 1) No more live docs present >>>>>> OR >>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND "version" >>>>>> stamp of the latest version , but SegmentInfos has an older >>>>>> "indexCreatedVersionMajor". >>>>>> >>>>>> This will help users a LOT since they can now interact with the index >>>>>> purely via API without needing manual deletion and also help open up a >>>>>> legitimate path to upgrade when an index doesn't HAVE to be repopulated >>>>>> from the source. >>>>>> >>>>>> If there is agreement, I am happy to pick this up and submit a PR. >>>>>> >>>>>> Thanks, >>>>>> Rahul Goswami >>>>>> >>>>>> >>>>>>