[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2026-01-31 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055685#comment-18055685
 ] 

Rahul Goswami edited comment on SOLR-17725 at 1/31/26 9:00 PM:
---

Based on [~dsmiley]'s suggestion, and with 
[#3903|https://github.com/apache/solr/pull/3903] ripe for merging, linking a 
separate JIRA (SOLR-18096) to the PR and to this JIRA (which provides the merge 
policy that powers the PR implementation)


was (Author: [email protected]):
Based on [~dsmiley]'s suggestion, and with 
[#3903|https://github.com/apache/solr/pull/3903] (Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade) 
ripe for merging, linking a separate JIRA 
([SOLR-18096|https://issues.apache.org/jira/browse/SOLR-18096]) to the PR and 
to this JIRA (which provides the merge policy that powers the PR implementation)

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 11.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 7:02 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10.0 
release since #3883 is in, and by itself is sufficient to facilitate the 
upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of 
the manual steps and facilitates the upgrade in a more optimized way.

Also, I am glad #3883 was also merged into branch_9x (thanks David!). Which 
essentially means any index originally created in Solr 8.x now has an upgrade 
path to 9x and later without having to recreate the index from source.



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10.0 
release since #3883 is in, and by itself is sufficient to facilitate the 
upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of 
the manual steps and facilitates the upgrade in a more optimized way.

Also, I am glad #3883 was also merged into branch_9x. Which essentially means 
any index originally created in Solr 8.x now has an upgrade path to 9x and 
later without having to recreate the index from source.


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype 

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:55 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10.0 
release since #3883 is in, and by itself is sufficient to facilitate the 
upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of 
the manual steps and facilitates the upgrade in a more optimized way.

Also, I am glad #3883 was also merged into branch_9x. Which essentially means 
any index originally created in Solr 8.x now has an upgrade path to 9x and 
later without having to recreate the index from source.



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 by itself is sufficient to facilitate the upgrade. The 
UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual 
steps and facilitates the upgrade in a more optimized way.

Also, I am glad #3883 was also merged into branch_9x. Which essentially means 
any index originally created in Solr 8.x now has an upgrade path to 9x and 
later without having to recreate the index from source.


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in sche

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:51 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 by itself is sufficient to facilitate the upgrade. The 
UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual 
steps and facilitates the upgrade in a more optimized way.

Also, I am glad #3883 was also merged into branch_9x. Which essentially means 
any index originally created in Solr 8.x now has an upgrade path to 9x and 
later without having to recreate the index from source.



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 by itself is sufficient to facilitate the upgrade. The 
UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual 
steps and facilitates the upgrade in a more optimized way.

Also, I am glad #3883 was also merged into branch_9x. Which essentially means, 
any index originally created in Solr 8.x now has an upgrade path to 9x and 
later without having to recreate the index from source.


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shou

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:51 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 by itself is sufficient to facilitate the upgrade. The 
UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual 
steps and facilitates the upgrade in a more optimized way.

Also, I am glad #3883 was also merged into branch_9x. Which essentially means, 
any index originally created in Solr 8.x now has an upgrade path to 9x and 
later without having to recreate the index from source.



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 by itself is sufficient to facilitate the upgrade. The 
UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual 
steps and facilitates the upgrade in a more optimized way.

I am glad, #3883 was also merged into branch_9x. Which essentially means, any 
index originally created in Solr 8.x now has an upgrade path to 9x and later 
without having to recreate the index from source.


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn'

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:43 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 by itself is sufficient to facilitate the upgrade. The 
UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual 
steps and facilitates the upgrade in a more optimized way.

I am glad, #3883 was also merged into branch_9x. Which essentially means, any 
index originally created in Solr 8.x now has an upgrade path to 9x and later 
without having to recreate the index from source.



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, although with a few 
more manual steps and in a slightly less optimized way to what the 
UPGRADECOREINDEX Core Admin API does (in #3903).

I am glad, #3883 was also merged into branch_9x. Which essentially means, any 
index originally created in Solr 8.x now has an upgrade path to 9x and later 
without having to recreate the index from source.


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:40 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, although with a few 
more manual steps and in a slightly less optimized way to what the 
UPGRADECOREINDEX Core Admin API does (in #3903).

I am glad, #3883 was also merged into branch_9x. Which essentially means, any 
index originally created in Solr 8.x now has an upgrade path to 9x and later 
without having to recreate the index from source.



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).

I am glad, #3883 was also merged into branch_9x. Which essentially means, any 
index originally created in Solr 8.x now has an upgrade path to 9x and later 
without having to recreate the index from source.


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:39 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).

I am glad, #3883 was also merged into branch_9x. Which essentially means, any 
index originally created in Solr 8.x now has an upgrade path to 9x and later 
without having to recreate the index from source.



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:37 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
[#3883 |https://github.com/apache/solr/pull/3883] => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
[#3903 |https://github.com/apache/solr/pull/3903] => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
https://github.com/apache/solr/pull/3883 => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
https://github.com/apache/solr/pull/3903 => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:36 PM:


[~ichattopadhyaya]  This is waiting for [#3903 
|https://github.com/apache/solr/pull/3903]to be merged.

This JIRA is split across 2 PRs:
https://github.com/apache/solr/pull/3883 => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
https://github.com/apache/solr/pull/3903 => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).



was (Author: [email protected]):
[~ichattopadhyaya]  This is waiting for #3903 to be merged.

This JIRA is split across 2 PRs:
https://github.com/apache/solr/pull/3883 => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
https://github.com/apache/solr/pull/3903 => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr.

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-12-30 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422
 ] 

Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:35 PM:


[~ichattopadhyaya]  This is waiting for #3903 to be merged.

This JIRA is split across 2 PRs:
https://github.com/apache/solr/pull/3883 => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
https://github.com/apache/solr/pull/3903 => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).



was (Author: [email protected]):
[~ichattopadhyaya] 
This is waiting for #3903 to be merged.

This JIRA is split across 2 PRs:
https://github.com/apache/solr/pull/3883 => A new 
LatestVersionMergePolicyFactory to block older segments from participating in 
merges. (This is merged)
https://github.com/apache/solr/pull/3903 => Expose a 
/admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade

However even with just #3883, users should be able to configure the merge 
policy in solrconfig and simply reindex the data. That would be enough to 
enable them to upgrade to Solr 11 in the future without recreating the core. As 
discussed in the thread, I was able to get the associated Lucene PRs into main 
and Lucene 10, so we are good there.

I am almost done with testing #3903, pending one integration issue while 
calling the REST endpoint in async mode (passing async=request_id param). I 
expect to open it up for reviews by tonight. But that should not hold the 10x 
release since #3883 still provides a pathway to upgrade, with a few more manual 
steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core 
Admin API does (in #3903).


> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0, 9.11
>
> Attachments: High Level Design.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, 

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-11-20 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039569#comment-18039569
 ] 

Rahul Goswami edited comment on SOLR-17725 at 11/20/25 11:38 PM:
-

https://github.com/apache/lucene/pull/14607 is merged, so now we should be able 
to open an index irrespective of which version it was created in, as long as 
all segments are either LATEST or LATEST-1 version. This is the part that will 
be achieved through this JIRA. Submitted a PR for a custom merge policy to kick 
off code contribution on this effort. Thanks [~dsmiley] and [~magibney] for the 
pointers!


was (Author: [email protected]):
https://github.com/apache/lucene/pull/14607 is merged, so now we should be able 
to open an index irrespective of which version it was created in, as long as 
all segments are either LATEST or LATEST-1 version. This is the part that will 
be achieved through this JIRA. Submitted a PR for a custom merge policy to kick 
off code contribution on this effort. 

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Attachments: High Level Design.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-11-20 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039569#comment-18039569
 ] 

Rahul Goswami edited comment on SOLR-17725 at 11/20/25 8:38 AM:


https://github.com/apache/lucene/pull/14607 is merged, so now we should be able 
to open an index irrespective of which version it was created in, as long as 
all segments are either LATEST or LATEST-1 version. This is the part that will 
be achieved through this JIRA. Submitted a PR for a custom merge policy to kick 
off code contribution on this effort. 


was (Author: [email protected]):
Lucene https://github.com/apache/lucene/pull/14607 is merged, so now we should 
be able to open an index irrespective of which version it was created in, as 
long as all segments are either LATEST or LATEST-1 version. This is the part 
that will be achieved through this JIRA. Submitted a PR for a custom merge 
policy to kick off code contribution on this effort. 

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
>  Labels: pull-request-available
> Attachments: High Level Design.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-09-12 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704
 ] 

Rahul Goswami edited comment on SOLR-17725 at 9/12/25 5:07 AM:
---

[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the 
> operational overhead of index uprgade during Solr uprgade when possible. The 
> fact that the design avoids the need for 2x disk space is a definite major 
> advantage.
 
3) Requring Lucene API change is a potential blocker, I'd not be surprised if 
the Lucene project rejects making the "created-version" property writable, so 
such a discussion with them would come early
> I agree. I am hopeful(!!) this will not be rejected though since they can 
> implement guardrails around changing the "created-version" property for added 
> security. In my implementation I added the change in Lucene IndexWriter to 
> check for all the segments in a commit and ensure they are the new version in 
> every aspect before setting the created-version property. This already 
> happens in a synchronized block upon commit, so in my (limited) opinion, it 
> should be safe. The API they give us can do all required internal validations 
> and fail gracefully without any harm to the index. I can get a discussion 
> started with the Lucene folks once we agree on the basics of this 
> implementation. Or do you suggest I do that right away?
 
4) Obviously a new Solr API needs to play well with SolrCloud as well as other 
features such such as shard split / move etc. Have you thought about locking / 
conflicts?
> SolrCloud challenges are not factored into the current implementation. But 
> given the process works at Core level and agnostic of the mode, I am 
> optimistic we can adapt the solution for SolrCloud through PR discussions.
We might have to block certain operations like splitshard while this process is 
underway on a collection. 
 
5) A reindex-collection API is probably wanted, however it could be acceptable 
to implement a "core-level" API first and later add a "collection-level" API on 
top of it
> Agreed
 
6) Challenge the assumption that "in-place" segment level is the best choice 
for this feature. Re-indexing into a new collection due to major schema changes 
is also a common use case that this will not address
> I would revert to my answer to your second question in defense of the 
> "in-place" implementation. Segment level processing gives us the ability to 
> restrict pollution of index due to merges as we reindex and also 
> restartability. 
Agreed this is not a substitute for when a field data type changes. This is 
intended to be a substitute for index upgrade when you upgrade Solr so as to 
overcome the X --> X+1 --> X+2 version upgrade path limitation which exists 
today despite no schema changes. Of course, users are free to add new fields 
and should still be able to use this utility.


was (Author: [email protected]):
[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and d

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-09-11 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704
 ] 

Rahul Goswami edited comment on SOLR-17725 at 9/12/25 5:10 AM:
---

[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the 
> operational overhead of index uprgade during Solr uprgade when possible. The 
> fact that the design avoids the need for 2x disk space is a definite major 
> advantage.
 
3) Requring Lucene API change is a potential blocker, I'd not be surprised if 
the Lucene project rejects making the "created-version" property writable, so 
such a discussion with them would come early
> I agree. I am hopeful(!!) this will not be rejected though since they can 
> implement guardrails around changing the "created-version" property for added 
> security. In my implementation I added the change in Lucene IndexWriter to 
> check for all the segments in a commit and ensure they are the new version in 
> every aspect before setting the created-version property. This already 
> happens in a synchronized block upon commit, so in my opinion, it should be 
> safe. The API they give us can do all required internal validations and fail 
> gracefully without any harm to the index. I can get a discussion started on 
> the Lucene dev list once we agree on the basics of this implementation. Or do 
> you suggest I do that right away?
 
4) Obviously a new Solr API needs to play well with SolrCloud as well as other 
features such such as shard split / move etc. Have you thought about locking / 
conflicts?
> SolrCloud challenges are not factored into the current implementation. But 
> given the process works at Core level and agnostic of the mode, I am 
> optimistic we can adapt the solution for SolrCloud through PR discussions.
We might have to block certain operations like splitshard while this process is 
underway on a collection. 
 
5) A reindex-collection API is probably wanted, however it could be acceptable 
to implement a "core-level" API first and later add a "collection-level" API on 
top of it
> Agreed
 
6) Challenge the assumption that "in-place" segment level is the best choice 
for this feature. Re-indexing into a new collection due to major schema changes 
is also a common use case that this will not address
> I would revert to my answer to your second question in defense of the 
> "in-place" implementation. Segment level processing gives us the ability to 
> restrict pollution of index due to merges as we reindex and also 
> restartability. 
Agreed this is not a substitute for when a field data type changes. This is 
intended to be a substitute for index upgrade when you upgrade Solr so as to 
overcome the X --> X+1 --> X+2 version upgrade path limitation which exists 
today despite no schema changes. Of course, users are free to add new fields 
and should still be able to use this utility.


was (Author: [email protected]):
[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-09-11 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704
 ] 

Rahul Goswami edited comment on SOLR-17725 at 9/12/25 5:09 AM:
---

[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the 
> operational overhead of index uprgade during Solr uprgade when possible. The 
> fact that the design avoids the need for 2x disk space is a definite major 
> advantage.
 
3) Requring Lucene API change is a potential blocker, I'd not be surprised if 
the Lucene project rejects making the "created-version" property writable, so 
such a discussion with them would come early
> I agree. I am hopeful(!!) this will not be rejected though since they can 
> implement guardrails around changing the "created-version" property for added 
> security. In my implementation I added the change in Lucene IndexWriter to 
> check for all the segments in a commit and ensure they are the new version in 
> every aspect before setting the created-version property. This already 
> happens in a synchronized block upon commit, so in my opinion, it should be 
> safe. The API they give us can do all required internal validations and fail 
> gracefully without any harm to the index. I can get a discussion started with 
> the Lucene folks once we agree on the basics of this implementation. Or do 
> you suggest I do that right away?
 
4) Obviously a new Solr API needs to play well with SolrCloud as well as other 
features such such as shard split / move etc. Have you thought about locking / 
conflicts?
> SolrCloud challenges are not factored into the current implementation. But 
> given the process works at Core level and agnostic of the mode, I am 
> optimistic we can adapt the solution for SolrCloud through PR discussions.
We might have to block certain operations like splitshard while this process is 
underway on a collection. 
 
5) A reindex-collection API is probably wanted, however it could be acceptable 
to implement a "core-level" API first and later add a "collection-level" API on 
top of it
> Agreed
 
6) Challenge the assumption that "in-place" segment level is the best choice 
for this feature. Re-indexing into a new collection due to major schema changes 
is also a common use case that this will not address
> I would revert to my answer to your second question in defense of the 
> "in-place" implementation. Segment level processing gives us the ability to 
> restrict pollution of index due to merges as we reindex and also 
> restartability. 
Agreed this is not a substitute for when a field data type changes. This is 
intended to be a substitute for index upgrade when you upgrade Solr so as to 
overcome the X --> X+1 --> X+2 version upgrade path limitation which exists 
today despite no schema changes. Of course, users are free to add new fields 
and should still be able to use this utility.


was (Author: [email protected]):
[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-24 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396
 ] 

Rahul Goswami edited comment on SOLR-17725 at 4/25/25 5:36 AM:
---

Will do [~dsmiley] Thanks.

 

[~gus] As far as I can see, the current implementation doesn't run the risk of 
corruption. The status is maintained in two ways:

1) At the core level -> to keep track of which core was being processed when 
the service went down/killed. A file autoupgrade_status.csv is maintained which 
is written each time a core is picked up for processing and a status is set for 
the same. Each time the process resumes it picks up the core with status 
"REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper 
. This is an implementation detail I am happy to discuss further, but in our 
(Commvault's)  implementation we recognize the following statuses

            DEFAULT,
            REINDEXING_ACTIVE,
            REINDEXING_PAUSED,
            PROCESSED,
            ERROR,
            CORRECTVERSION

 

2) At the segment level -> This is where we piggyback on Lucene's design and 
it's beautiful! As we iterate over each segment, we read the live docs out of  
the segment, create a SolrInputDocument out of it and reindex using Solr's API. 
This helps achieve two things:

i) A reindexed doc helps mark an existing (old) doc as deleted (when 
auto-commit kicks in). This way if the service goes down, we don't need to 
process the already processed docs of the segment. And if the service goes down 
before a commit could be processed, the small penalty is reprocessing the docs 
of only that segment. 

ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it 
reclaiming space in the process. Hence we never process the same segment again.

Note that as we do this, we are in no way interfering with Lucene's index 
structure directly and only interacting by means of APIs.

 

A combination of these factors helps maintain continuity in the processing of a 
core despite failures, without running the risk of corruption.


was (Author: [email protected]):
Will do [~dsmiley] Thanks.

 

[~gus] As far as I can see, the current implementation doesn't run the risk of 
corruption. The status is maintained in two ways:

1) At the core level -> to keep track of which core was being processed when 
the service went down/killed. A file autoupgrade_status.csv is maintained which 
is written each time a core is picked up for processing and a status is set for 
the same. Each time the process resumes it picks up the core with status 
"REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper 
. This is an implementation detail I am happy to discuss further, but in our 
(Commvault's)  implementation we recognize the following statuses

            DEFAULT,
            REINDEXING_ACTIVE,
            REINDEXING_PAUSED,
            PROCESSED,
            ERROR,
            CORRECTVERSION

 

2) At the segment level -> This is where we piggyback on Lucene's design and 
it's beautiful! As we iterate over each segment, we read the live docs out of  
the segment, create a SolrInputDocument out of it and reindex using Solr's API. 
This helps achieve two things:

i) A reindexed doc helps mark an existing (old) doc as deleted (when 
auto-commit kicks in). This way if the service goes down, we don't need to 
process the already processed docs of the segment. And if the service goes down 
before a commit could be processed, the small penalty is reprocessing the docs 
of only that segment. 

ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it 
reclaiming space in the process. Hence we never process the same segment again.

Note that as we do this, we are in no way interfering with Lucene's index 
structure directly and only interacting by means of APIs.

 

A combination of these factors helps maintain continuity in the processing of a 
core despite failures, without running the risk of corruption.

 

 

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
> Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely 

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-18 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704
 ] 

Rahul Goswami edited comment on SOLR-17725 at 4/19/25 12:03 AM:


[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the 
> operational overhead of index uprgade during Solr uprgade when possible. 
 
3) Requring Lucene API change is a potential blocker, I'd not be surprised if 
the Lucene project rejects making the "created-version" property writable, so 
such a discussion with them would come early
> I agree. I am hopeful(!!) this will not be rejected though since they can 
> implement guardrails around changing the "created-version" property for added 
> security. In my implementation I added the change in Lucene IndexWriter to 
> check for all the segments in a commit and ensure they are the new version in 
> every aspect before setting the created-version property. This already 
> happens in a synchronized block upon commit, so in my (limited) opinion, it 
> should be safe. The API they give us can do all required internal validations 
> and fail gracefully without any harm to the index. I can get a discussion 
> started with the Lucene folks once we agree on the basics of this 
> implementation. Or do you suggest I do that right away?
 
4) Obviously a new Solr API needs to play well with SolrCloud as well as other 
features such such as shard split / move etc. Have you thought about locking / 
conflicts?
> SolrCloud challenges are not factored into the current implementation. But 
> given the process works at Core level and agnostic of the mode, I am 
> optimistic we can adapt the solution for SolrCloud through PR discussions.
We might have to block certain operations like splitshard while this process is 
underway on a collection. 
 
5) A reindex-collection API is probably wanted, however it could be acceptable 
to implement a "core-level" API first and later add a "collection-level" API on 
top of it
> Agreed
 
6) Challenge the assumption that "in-place" segment level is the best choice 
for this feature. Re-indexing into a new collection due to major schema changes 
is also a common use case that this will not address
> I would revert to my answer to your second question in defense of the 
> "in-place" implementation. Segment level processing gives us the ability to 
> restrict pollution of index due to merges as we reindex and also 
> restartability. 
Agreed this is not a substitute for when a field data type changes. This is 
intended to be a substitute for index upgrade when you upgrade Solr so as to 
overcome the X --> X+1 --> X+2 version upgrade path limitation which exists 
today despite no schema changes. Of course, users are free to add new fields 
and should still be able to use this utility.


was (Author: [email protected]):
[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the 
> o

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-18 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940243#comment-17940243
 ] 

Rahul Goswami edited comment on SOLR-17725 at 4/19/25 12:00 AM:


 

Attached document outlines an example where the upgrade tool works on an index 
originally created in Solr 7.x, AFTER an upgrade to Solr 8.x. 

 

Key points:

1) Lucene version X can read index created in version X-1. Writing of new 
segments happens with the latest version codec.

2) When a segment merge happens, the segment maintains a version stamp 
"minVersion" which is the least version of the segment participating in a merge.

3) The segments_* file in a Lucene index maintains the Lucene version where the 
index was first created.

 

The design doc outlines the process of converting all segments to the new 
version. It's sort of a pull model where you first upgrade and then "pull" the 
index to the current version.

By the end of the process outlined in the doc, all segments get converted to 
the new version and the index in all respects is an "upgraded" index. The only 
missing piece is to update the index creation version in the commit point. I 
did this by exposing a method in Lucene's IndexWriter which validates the 
version of all segments and updates the creation version stamp in the commit 
point (we might need to request an API from Lucene here). When this index is 
opened in Solr 9.x, it can read this index (thanks to point #1) and the same 
process repeats to make the index ready for Solr 10.x. 


was (Author: [email protected]):
 

Attached document outlines an example where the upgrade tool works on an index 
originally created in Solr 7.x, AFTER an upgrade to Solr 8.x. 

 

Key points:

1) Lucene version X can read index created in version X-1. Writing of new 
segments happens with the latest version codec.

2) When a segment merge happens, the segment maintains a version stamp 
"minVersion" which is the least version of the segment participating in a merge.

3) The segments_* file in a Lucene index maintains the Lucene version where the 
index was first created.

 

The design doc outlines the process of converting all segments to the new 
version. It's sort of a pull model where you first upgrade and then "pull" the 
index to the current version.

By the end of the process outlined in the doc, all segments get converted to 
the new version and the index in all respects is an "upgraded" index. The only 
missing piece is to update the index creation version in the commit point. I 
did this by exposing a method in Lucene's CommitInfos which validates the 
version of all segments and updates the creation version stamp in the commit 
point (we might need to request an API from Lucene here). When this index is 
opened in Solr 9.x, it can read this index (thanks to point #1) and the same 
process repeats to make the index ready for Solr 10.x. 

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
> Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-14 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396
 ] 

Rahul Goswami edited comment on SOLR-17725 at 4/14/25 3:44 PM:
---

Will do [~dsmiley] Thanks.

 

[~gus] As far as I can see, the current implementation doesn't run the risk of 
corruption. The status is maintained in two ways:

1) At the core level -> to keep track of which core was being processed when 
the service went down/killed. A file autoupgrade_status.csv is maintained which 
is written each time a core is picked up for processing and a status is set for 
the same. Each time the process resumes it picks up the core with status 
"REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper 
. This is an implementation detail I am happy to discuss further, but in our 
(Commvault's)  implementation we recognize the following statuses

            DEFAULT,
            REINDEXING_ACTIVE,
            REINDEXING_PAUSED,
            PROCESSED,
            ERROR,
            CORRECTVERSION

 

2) At the segment level -> This is where we piggyback on Lucene's design and 
it's beautiful! As we iterate over each segment, we read the live docs out of  
the segment, create a SolrInputDocument out of it and reindex using Solr's API. 
This helps achieve two things:

i) A reindexed doc helps mark an existing (old) doc as deleted (when 
auto-commit kicks in). This way if the service goes down, we don't need to 
process the already processed docs of the segment. And if the service goes down 
before a commit could be processed, the small penalty is reprocessing the docs 
of only that segment. 

ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it 
reclaiming space in the process. Hence we never process the same segment again.

Note that as we do this, we are in no way interfering with Lucene's index 
structure directly and only interacting by means of APIs.

 

A combination of these factors helps maintain continuity in the processing of a 
core despite failures, without running the risk of corruption.

 

 


was (Author: [email protected]):
Will do [~dsmiley] Thanks.

 

[~gus] As far as I can see, the current implementation doesn't run the risk of 
corruption. The status is maintained in two ways:

1) At the core level -> to keep track of which core was being processed when 
the service went down/killed. A file autoupgrade_status.csv is maintained which 
is written each time a core is picked up for processing and a status is set for 
the same. Each time the process resumes it picks up the core with status 
"REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper 
. This is an implementation detail I am happy to discuss further, but in our 
(Commvault's)  implementation we recognize the following statuses

            DEFAULT,
            REINDEXING_ACTIVE,
            REINDEXING_PAUSED,
            PROCESSED,
            ERROR,
            CORRECTVERSION

 

2) At the segment level -> This is where we piggyback on Lucene's design and 
it's beautiful! As we iterate over each segment, we read the live docs out of  
the segment, create a SolrInputDocument out of it and reindex using Solr's API. 
This helps achieve two things:

i) A reindexed doc helps mark an existing (old) doc as deleted (when 
auto-commit kicks in). This way if the service goes down, we don't need to 
process the already processed docs of the service. And if the service goes down 
before a commit could be processed, the small penalty is reprocessing the docs 
of only that segment. 

ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it 
reclaiming space in the process. Hence we never process the same segment again.

Note that as we do this, we are in no way interfering with Lucene's index 
structure directly and only interacting by means of APIs.

 

A combination of these factors helps maintain continuity in the processing of a 
core despite failures, without running the risk of corruption.

 

 

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
> Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not compl

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-14 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396
 ] 

Rahul Goswami edited comment on SOLR-17725 at 4/14/25 3:43 PM:
---

Will do [~dsmiley] Thanks.

 

[~gus] As far as I can see, the current implementation doesn't run the risk of 
corruption. The status is maintained in two ways:

1) At the core level -> to keep track of which core was being processed when 
the service went down/killed. A file autoupgrade_status.csv is maintained which 
is written each time a core is picked up for processing and a status is set for 
the same. Each time the process resumes it picks up the core with status 
"REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper 
. This is an implementation detail I am happy to discuss further, but in our 
(Commvault's)  implementation we recognize the following statuses

            DEFAULT,
            REINDEXING_ACTIVE,
            REINDEXING_PAUSED,
            PROCESSED,
            ERROR,
            CORRECTVERSION

 

2) At the segment level -> This is where we piggyback on Lucene's design and 
it's beautiful! As we iterate over each segment, we read the live docs out of  
the segment, create a SolrInputDocument out of it and reindex using Solr's API. 
This helps achieve two things:

i) A reindexed doc helps mark an existing (old) doc as deleted (when 
auto-commit kicks in). This way if the service goes down, we don't need to 
process the already processed docs of the service. And if the service goes down 
before a commit could be processed, the small penalty is reprocessing the docs 
of only that segment. 

ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it 
reclaiming space in the process. Hence we never process the same segment again.

Note that as we do this, we are in no way interfering with Lucene's index 
structure directly and only interacting by means of APIs.

 

A combination of these factors helps maintain continuity in the processing of a 
core despite failures, without running the risk of corruption.

 

 


was (Author: [email protected]):
Will do [~dsmiley] Thanks.

 

[~gus] As far as I can see, the current implementation doesn't run the risk of 
corruption. The status is maintained in two ways:

1) At the core level -> to keep track of which core was being processed when 
the service went down/killed. A file autoupgrade_status.csv is maintained which 
is written each time a core is picked up for processing and a status is set for 
the same. Each time the process resumes it picks up the core with status 
"REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper 
. This is an implementation detail I am happy to discuss further, but in our 
(Commvault's)  implementation we recognize the following statuses

            DEFAULT,
            REINDEXING_ACTIVE,
            REINDEXING_PAUSED,
            PROCESSED,
            ERROR,
            CORRECTVERSION

 

2) At the segment level -> This is where we piggyback on Lucene's design and 
it's beautiful! As we iterate over each segment, we are read the live docs out 
of  the segment, create a SolrInputDocument out of it and reindex using Solr's 
API. This helps achieve two things: 

i) A reindexed doc helps mark an existing (old) doc as deleted (when 
auto-commit kicks in). This way if the service goes down, we don't need to 
process the already processed docs of the service. And if the service goes down 
before a commit could be processed, the small penalty is reprocessing the docs 
of only that segment. 

ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it 
reclaiming space in the process. Hence we never process the same segment again.

Note that as we do this, we are in no way interfering with Lucene's index 
structure directly and only interacting by means of APIs.

 

A combination of these factors helps maintain continuity in the processing of a 
core despite failures, without running the risk of corruption.

 

 

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
> Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not 

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-13 Thread Gus Heck (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943925#comment-17943925
 ] 

Gus Heck edited comment on SOLR-17725 at 4/13/25 2:17 PM:
--

I asked it in the user list thread, but didn't see an answer (sorry if I missed 
it). As [~ab] also noted, we need to understand what happens if a node fails 
part way through the process. (i.e. someone kills -9 it, or nobody saw the 
email from amazon that the hardware underlying the VPC instance needs to be 
rebooted... etc.) How does the process resume where it left off, or roll back 
to prevent a corrupted index?


was (Author: gus_heck):
I asked it in the user list thread, but didn't see an answer (sorry if I missed 
it). As ab also noted, we need to understand what happens if a node fails part 
way through the process. (i.e. someone kills -9 it, or nobody saw the email 
from amazon that the hardware underlying the VPC instance needs to be 
rebooted... etc.) How does the process resume where it left off, or roll back 
to prevent a corrupted index?

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
> Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environments with hundreds/thousands of very large 
> indexes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-07 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704
 ] 

Rahul Goswami edited comment on SOLR-17725 at 4/7/25 8:11 PM:
--

[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the 
> operational overhead of index uprgade during Solr uprgade when possible. 
 
3) Requring Lucene API change is a potential blocker, I'd not be surprised if 
the Lucene project rejects making the "created-version" property writable, so 
such a discussion with them would come early
> I agree. I am hopeful(!!) this will not be rejected though since they can 
> implement guardrails around changing the "created-version" property for added 
> security. In my implementation I added the change in CommitInfos to check for 
> all the segments in a commit and ensure they are the new version in every 
> aspect before setting the created-version property. This already happens in a 
> synchronized block upon commit, so in my (limited) opinion, it should be 
> safe. The API they give us can do all required internal validations and fail 
> gracefully without any harm to the index. I can get a discussion started with 
> the Lucene folks once we agree on the basics of this implementation. Or do 
> you suggest I do that right away?
 
4) Obviously a new Solr API needs to play well with SolrCloud as well as other 
features such such as shard split / move etc. Have you thought about locking / 
conflicts?
> SolrCloud challenges are not factored into the current implementation. But 
> given the process works at Core level and agnostic of the mode, I am 
> optimistic we can adapt the solution for SolrCloud through PR discussions.
We might have to block certain operations like splitshard while this process is 
underway on a collection. 
 
5) A reindex-collection API is probably wanted, however it could be acceptable 
to implement a "core-level" API first and later add a "collection-level" API on 
top of it
> Agreed
 
6) Challenge the assumption that "in-place" segment level is the best choice 
for this feature. Re-indexing into a new collection due to major schema changes 
is also a common use case that this will not address
> I would revert to my answer to your second question in defense of the 
> "in-place" implementation. Segment level processing gives us the ability to 
> restrict pollution of index due to merges as we reindex and also 
> restartability. 
Agreed this is not a substitute for when a field data type changes. This is 
intended to be a substitute for index upgrade when you upgrade Solr so as to 
overcome the X --> X+1 --> X+2 version upgrade path limitation which exists 
today despite no schema changes. Of course, users are free to add new fields 
and should still be able to use this utility.


was (Author: [email protected]):
[~janhoy]  Thanks for taking the time to review the JIRA. Please find my 
thoughts on your questions below:

 
1) Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless. 
> Indexing calls happen using Solr APIs so Solr will need to be running. In our 
> custom implementation I have hooked the process into SolrDispatchFilter 
> load() so that the process can start upon server start for least operational 
> overhead. As a generic solution I am thinking we can expose it as an action 
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for 
> trackability. This way users can hook up the command into their shell/cmd 
> scripts after Solr starts. Open to suggestions here,  
 
2) Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the 
> operational 

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-07 Thread Rahul Goswami (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941691#comment-17941691
 ] 

Rahul Goswami edited comment on SOLR-17725 at 4/7/25 7:18 PM:
--

[~ab] For those running SolrCloud AND having enough capacity in terms of 
infrastructure and budget, the REINDEXCOLLECTION command is a good option. I 
see that it reindexes onto a parallel collection. So for clusters with 
hundreds/thousands of large indexes, that cost can be substantial. Also the 
source collection is put in read-only mode while the reindexing happens. So can 
be a point of contention in case of environments which are more update heavy 
than search heavy (for eg: for us at Commvault). 

By means of this Jira I am attempting to overcome the Lucene limitation which 
forces you to reindex from source, when you really don't HAVE to. At least I 
would like to offer that option to users who are more cost sensitive or 
operationally sensitive (eg: Solutions which package Solr as part of the 
application and are installed/deployed on customer sites. It can be awkward to 
reason with customers as to why a solution upgrade may need a 
downtime/additional infra capacity if it involves a Solr upgrade).

The proposed solution reindexes into the same core, can be easily adapted to 
work with both standalone Solr and SolrCloud, and allows both updates and 
searches to be served while doing so. This also helps remove additional 
operational overhead since now users can focus on just the Solr upgrade without 
having to worry about index compatibility.   

 


was (Author: [email protected]):
[~ab] For those running SolrCloud AND having enough capacity in terms of 
infrastructure and budget, the REINDEXCOLLECTION command is a good option. I 
see that it reindexes onto a parallel collection. So for clusters with 
hundreds/thousands of large indexes, that cost can be substantial. Also the 
source collection is put in read-only mode while the reindexing happens. So can 
be a point of contention in case of environments which are more update heavy 
than search heavy (for eg: for us at Commvault). 

By means of this Jira I am attempting to overcome the Lucene limitation which 
forces you to reindex from source, when you really don't HAVE to. At least I 
would like to offer that option to users who are more cost sensitive or 
operationally sensitive (eg: Solutions which package Solr as part of the 
application and are installed/deployed on customer sites. It can be awkward to 
reason with customers as to why a solution upgrade may need a downtime if it 
involves a Solr upgrade).

The proposed solution reindexes into the same core, can be easily adapted to 
work with both standalone Solr and SolrCloud, and allows both updates and 
searches to be served while doing so. This also helps remove additional 
operational overhead since now users can focus on just the Solr upgrade without 
having to worry about index compatibility.   

 

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
> Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> hea

[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source

2025-04-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940666#comment-17940666
 ] 

Jan Høydahl edited comment on SOLR-17725 at 4/3/25 11:48 AM:
-

Please clarify your intent with this Jira before continuting with any code 
contributions. While I think such a feature would benefit many Solr users, it 
would be sad to spend lots of time on a particular direction / implementation 
before higher level questions / designs are clarified. As such, you did the 
correct ting starting a mailing list thread and a JIRA.

My initial questions:
 * Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
 * Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges?
 * Requring Lucene API change is a potential blocker, I'd not be surprised if 
the Lucene project rejects making the "created-version" property writable, so 
such a discussion with them would come early
 * Obviously a new Solr API needs to play well with SolrCloud as well as other 
features such such as shard split / move etc. Have you thought about locking / 
conflicts?
 * A reindex-collection API is probably wanted, however it could be acceptable 
to implement a "core-level" API first and later add a "collection-level" API on 
top of it
 * Challenge the assumption that "in-place" segment level is the best choice 
for this feature. Re-indexing into a new collection due to major schema changes 
is also a common use case that this will not address


was (Author: janhoy):
Please clarify your intent with this Jira before continuting with any code 
contributions. While I think such a feature would benefit many Solr users, it 
would be sad to spend lots of time on a particular direction / implementation 
before higher level questions / designs are clarified. As such, you did the 
correct ting starting a mailing list thread and a JIRA.

My initial questions:
 * Do you intend for this to be a new Solr API, if so what is the proposed API? 
or a CLI utility tool to run on a cold index folder?
 * Is one of your design goals to avoid the need for 2-3x disk space during the 
reindex, since you work on segment level and do merges
 * Requring Lucene API change is a potential blocker, I'd not be surprised if 
the Lucene project rejects making the "created-version" property writable, so 
such a discussion with them would come early
 * Obviously a new Solr API needs to play well with SolrCloud as well as other 
features such such as shard split / move etc. It could however be acceptable to 
implement a "core-level" API first and later a "cluser-level" on top of it
 * Challenge the assumption that "in-place" segment level is the best choice 
for this feature. Re-index into a new collection due to major schema changes is 
also a common use case that this will not address

> Automatically upgrade Solr indexes without needing to reindex from source
> -
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
>  Issue Type: Improvement
>Reporter: Rahul Goswami
>Priority: Major
> Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of 
> data from source. This comes from Lucene's constraint which only guarantees 
> index compatibility between the version the index was created in and the 
> immediate next version. 
> This reindexing usually comes with added downtime and/or cost. Especially in 
> case of deployments which are in customer environments and not completely in 
> control of the vendor, this proposition of having to completely reindex the 
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves 
> this reindexing in-place on the same index. Also, the process automatically 
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without 
> needing manual intervention. 
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any 
> copyField destination fields can be stored=false of course, just that the 
> source fields (or more precisely, the source fields you care about 
> preserving) should be either stored or docValues true. 
> ii) The datatype of an existing field in schema.xml shouldn't change upon 
> Solr upgrade. Introducing new fields is fine. 
> For indexes where this limitation is not a problem (it wasn't for us!), the 
> tool can reindex in-place on the same core with zero downtime and 
> legitimately "upgrade" the index. This can remove a lot of operational 
> headaches, especially in environ