Got it, thanks for providing additional context on the use case!

On Sat, Apr 19, 2025 at 9:40 PM Rahul Goswami <rahul196...@gmail.com> wrote:

> There is an effort underway in Apache Solr where we want to provide a path
> to a legitimate upgrade without needing to reindex from source:
> https://issues.apache.org/jira/browse/SOLR-17725
>
> Essentially the proposal is to read documents from segments where
> minVersion < current version and reindex them. At the same time, while
> the process is underway,  have a custom merge policy which would exclude
> such segments from merging with latest version segments to prevent
> pollution.
>
> Result is an index which only contains segments with minVersion and
> version stamps the same as the current Lucene version (essentially case #2
> that we discussed). This index would in all respects be an "upgraded"
> index, but would need "indexCreatedVersionMajor" to be reset as well. This
> is where the Lucene API (to reset "indexCreatedVersionMajor") becomes
> essential.
>
> I believe this is a pattern which can also be adopted by other Lucene
> based search engines like Opensearch and Elasticsearch, and hence having
> this API could potentially benefit a large Lucene base.
>
> -Rahul
>
> On Sat, Apr 19, 2025 at 11:49 PM Ankit Jain <jain.ank...@gmail.com> wrote:
>
>> > Consider the following sequence of events...
>> an index with 2 segments (seg1 and seg2) originally created in Lucene
>> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets
>> created with version 9.x, but merge doesn't kick in ==> documents in seg1
>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x
>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>>
>> Thanks for the explanation. I am wondering if this is something that you
>> commonly encounter, seems like a bit of an edge case?
>>
>> Regarding scenario 1, deleting the entire index and recreating it is
>> generally faster and less resource intensive instead of deleting all the
>> documents. Most systems built on top of Lucene like Solr, OpenSearch,
>> Elasticsearch expose delete API for collection/index, and users just delete
>> and recreate the index. Probably, one of the reasons it hasn't come up much
>> before. Will let other community members chime in on this.
>>
>> On Sat, Apr 19, 2025 at 7:43 PM Rahul Goswami <rahul196...@gmail.com>
>> wrote:
>>
>>> For complete clarity..."minVersion" for a SegmentInfo is the min of the
>>> minVersions of all segments involved in the merge which resulted in this
>>> segment. If it is a "pure" segment, then minVersion=version.
>>>
>>> On Sat, Apr 19, 2025 at 10:35 PM Rahul Goswami <rahul196...@gmail.com>
>>> wrote:
>>>
>>>> Ankit,
>>>> "I guess the SegmentInfo "minVersion" is the min across all segments
>>>> during the merge process?"
>>>> > That is correct
>>>>
>>>> I am wondering if there is any way to end up in the 2nd scenario,
>>>> without having deleted all the documents first?
>>>> > Consider the following sequence of events...
>>>> an index with 2 segments (seg1 and seg2) originally created in Lucene
>>>> 8.x.  ==> Upgrade to 9.x ==> index few documents and commit ==> seg3 gets
>>>> created with version 9.x, but merge doesn't kick in ==> documents in seg1
>>>> and seg2 get deleted followed by commit.==> You are left with seg3 in 9.x
>>>> but indexCreatedVersionMajor as 8.x ==> Upgrade to Lucene 10.x fails.
>>>>
>>>> -Rahul
>>>>
>>>> On Sat, Apr 19, 2025 at 1:01 PM Ankit Jain <jain.ank...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Rahul,
>>>>>
>>>>> Thanks for starting this interesting discussion. I was initially
>>>>> thinking that this API potentially allows upgrading
>>>>> "indexCreatedVersionMajor" via the merge process after rewriting all the
>>>>> segments, but I guess the SegmentInfo "minVersion" is the min across all
>>>>> segments during the merge process?
>>>>>
>>>>> So, I am wondering if there is any way to end up in the 2nd scenario,
>>>>> without having deleted all the documents first?
>>>>>
>>>>>
>>>>> Thanks
>>>>> Ankit
>>>>>
>>>>> On Sat, Apr 19, 2025 at 9:17 AM Rahul Goswami <rahul196...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>> Today even after all documents in an index are deleted via an API
>>>>>> call, reindexing still doesn't change the "indexCreatedVersionMajor"
>>>>>> property value in SegmentInfos. Hence even after complete reindexing,
>>>>>> an upgrade path X--> X+1 --> X+2 is still not possible as we end up with 
>>>>>> an
>>>>>> IndexFormatTooOldException.
>>>>>>
>>>>>> Requesting an API (on IndexWriter?) which can reset this property
>>>>>> (upon a new commit) to the current Lucene version if:
>>>>>> 1) No more live docs present
>>>>>> OR
>>>>>> 2) If all SegmentInfo in the index have a "minVersion" AND "version"
>>>>>> stamp of the latest version , but SegmentInfos has an older
>>>>>> "indexCreatedVersionMajor".
>>>>>>
>>>>>> This will help users a LOT since they can now interact with the index
>>>>>> purely via API without needing manual deletion and also help open up a
>>>>>> legitimate path to upgrade when an index doesn't HAVE to be repopulated
>>>>>> from the source.
>>>>>>
>>>>>> If there is agreement, I am happy to pick this up and submit a PR.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul Goswami
>>>>>>
>>>>>>
>>>>>>

Reply via email to