[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055685#comment-18055685 ] Rahul Goswami edited comment on SOLR-17725 at 1/31/26 9:00 PM: --- Based on [~dsmiley]'s suggestion, and with [#3903|https://github.com/apache/solr/pull/3903] ripe for merging, linking a separate JIRA (SOLR-18096) to the PR and to this JIRA (which provides the merge policy that powers the PR implementation) was (Author: [email protected]): Based on [~dsmiley]'s suggestion, and with [#3903|https://github.com/apache/solr/pull/3903] (Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade) ripe for merging, linking a separate JIRA ([SOLR-18096|https://issues.apache.org/jira/browse/SOLR-18096]) to the PR and to this JIRA (which provides the merge policy that powers the PR implementation) > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 11.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 7:02 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10.0 release since #3883 is in, and by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. Also, I am glad #3883 was also merged into branch_9x (thanks David!). Which essentially means any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. was (Author: [email protected]): [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10.0 release since #3883 is in, and by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. Also, I am glad #3883 was also merged into branch_9x. Which essentially means any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:55 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10.0 release since #3883 is in, and by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. Also, I am glad #3883 was also merged into branch_9x. Which essentially means any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. was (Author: [email protected]): [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. Also, I am glad #3883 was also merged into branch_9x. Which essentially means any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in sche
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:51 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. Also, I am glad #3883 was also merged into branch_9x. Which essentially means any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. was (Author: [email protected]): [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. Also, I am glad #3883 was also merged into branch_9x. Which essentially means, any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shou
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:51 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. Also, I am glad #3883 was also merged into branch_9x. Which essentially means, any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. was (Author: [email protected]): [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. I am glad, #3883 was also merged into branch_9x. Which essentially means, any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn'
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:43 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 by itself is sufficient to facilitate the upgrade. The UPGRADECOREINDEX Core Admin API (in #3903) helps remove some of the manual steps and facilitates the upgrade in a more optimized way. I am glad, #3883 was also merged into branch_9x. Which essentially means, any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. was (Author: [email protected]): [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, although with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). I am glad, #3883 was also merged into branch_9x. Which essentially means, any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:40 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, although with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). I am glad, #3883 was also merged into branch_9x. Which essentially means, any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. was (Author: [email protected]): [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). I am glad, #3883 was also merged into branch_9x. Which essentially means, any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:39 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). I am glad, #3883 was also merged into branch_9x. Which essentially means, any index originally created in Solr 8.x now has an upgrade path to 9x and later without having to recreate the index from source. was (Author: [email protected]): [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:37 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: [#3883 |https://github.com/apache/solr/pull/3883] => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) [#3903 |https://github.com/apache/solr/pull/3903] => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). was (Author: [email protected]): [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: https://github.com/apache/solr/pull/3883 => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) https://github.com/apache/solr/pull/3903 => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:36 PM: [~ichattopadhyaya] This is waiting for [#3903 |https://github.com/apache/solr/pull/3903]to be merged. This JIRA is split across 2 PRs: https://github.com/apache/solr/pull/3883 => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) https://github.com/apache/solr/pull/3903 => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). was (Author: [email protected]): [~ichattopadhyaya] This is waiting for #3903 to be merged. This JIRA is split across 2 PRs: https://github.com/apache/solr/pull/3883 => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) https://github.com/apache/solr/pull/3903 => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr.
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048422#comment-18048422 ] Rahul Goswami edited comment on SOLR-17725 at 12/30/25 6:35 PM: [~ichattopadhyaya] This is waiting for #3903 to be merged. This JIRA is split across 2 PRs: https://github.com/apache/solr/pull/3883 => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) https://github.com/apache/solr/pull/3903 => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). was (Author: [email protected]): [~ichattopadhyaya] This is waiting for #3903 to be merged. This JIRA is split across 2 PRs: https://github.com/apache/solr/pull/3883 => A new LatestVersionMergePolicyFactory to block older segments from participating in merges. (This is merged) https://github.com/apache/solr/pull/3903 => Expose a /admin/cores?action=UPGRADECOREINDEX endpoint to handle the in-place upgrade However even with just #3883, users should be able to configure the merge policy in solrconfig and simply reindex the data. That would be enough to enable them to upgrade to Solr 11 in the future without recreating the core. As discussed in the thread, I was able to get the associated Lucene PRs into main and Lucene 10, so we are good there. I am almost done with testing #3903, pending one integration issue while calling the REST endpoint in async mode (passing async=request_id param). I expect to open it up for reviews by tonight. But that should not hold the 10x release since #3883 still provides a pathway to upgrade, with a few more manual steps and in a slightly less optimized way to what the UPGRADECOREINDEX Core Admin API does (in #3903). > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Fix For: 10.0, 9.11 > > Attachments: High Level Design.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands,
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039569#comment-18039569 ] Rahul Goswami edited comment on SOLR-17725 at 11/20/25 11:38 PM: - https://github.com/apache/lucene/pull/14607 is merged, so now we should be able to open an index irrespective of which version it was created in, as long as all segments are either LATEST or LATEST-1 version. This is the part that will be achieved through this JIRA. Submitted a PR for a custom merge policy to kick off code contribution on this effort. Thanks [~dsmiley] and [~magibney] for the pointers! was (Author: [email protected]): https://github.com/apache/lucene/pull/14607 is merged, so now we should be able to open an index irrespective of which version it was created in, as long as all segments are either LATEST or LATEST-1 version. This is the part that will be achieved through this JIRA. Submitted a PR for a custom merge policy to kick off code contribution on this effort. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Attachments: High Level Design.png > > Time Spent: 20m > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039569#comment-18039569 ] Rahul Goswami edited comment on SOLR-17725 at 11/20/25 8:38 AM: https://github.com/apache/lucene/pull/14607 is merged, so now we should be able to open an index irrespective of which version it was created in, as long as all segments are either LATEST or LATEST-1 version. This is the part that will be achieved through this JIRA. Submitted a PR for a custom merge policy to kick off code contribution on this effort. was (Author: [email protected]): Lucene https://github.com/apache/lucene/pull/14607 is merged, so now we should be able to open an index irrespective of which version it was created in, as long as all segments are either LATEST or LATEST-1 version. This is the part that will be achieved through this JIRA. Submitted a PR for a custom merge policy to kick off code contribution on this effort. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Labels: pull-request-available > Attachments: High Level Design.png > > Time Spent: 10m > Remaining Estimate: 0h > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami edited comment on SOLR-17725 at 9/12/25 5:07 AM: --- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. The > fact that the design avoids the need for 2x disk space is a definite major > advantage. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in Lucene IndexWriter to > check for all the segments in a commit and ensure they are the new version in > every aspect before setting the created-version property. This already > happens in a synchronized block upon commit, so in my (limited) opinion, it > should be safe. The API they give us can do all required internal validations > and fail gracefully without any harm to the index. I can get a discussion > started with the Lucene folks once we agree on the basics of this > implementation. Or do you suggest I do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. was (Author: [email protected]): [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and d
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami edited comment on SOLR-17725 at 9/12/25 5:10 AM: --- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. The > fact that the design avoids the need for 2x disk space is a definite major > advantage. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in Lucene IndexWriter to > check for all the segments in a commit and ensure they are the new version in > every aspect before setting the created-version property. This already > happens in a synchronized block upon commit, so in my opinion, it should be > safe. The API they give us can do all required internal validations and fail > gracefully without any harm to the index. I can get a discussion started on > the Lucene dev list once we agree on the basics of this implementation. Or do > you suggest I do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. was (Author: [email protected]): [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges?
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami edited comment on SOLR-17725 at 9/12/25 5:09 AM: --- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. The > fact that the design avoids the need for 2x disk space is a definite major > advantage. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in Lucene IndexWriter to > check for all the segments in a commit and ensure they are the new version in > every aspect before setting the created-version property. This already > happens in a synchronized block upon commit, so in my opinion, it should be > safe. The API they give us can do all required internal validations and fail > gracefully without any harm to the index. I can get a discussion started with > the Lucene folks once we agree on the basics of this implementation. Or do > you suggest I do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. was (Author: [email protected]): [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges?
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396 ] Rahul Goswami edited comment on SOLR-17725 at 4/25/25 5:36 AM: --- Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the segment. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. was (Author: [email protected]): Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the segment. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami edited comment on SOLR-17725 at 4/19/25 12:03 AM: [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in Lucene IndexWriter to > check for all the segments in a commit and ensure they are the new version in > every aspect before setting the created-version property. This already > happens in a synchronized block upon commit, so in my (limited) opinion, it > should be safe. The API they give us can do all required internal validations > and fail gracefully without any harm to the index. I can get a discussion > started with the Lucene folks once we agree on the basics of this > implementation. Or do you suggest I do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. was (Author: [email protected]): [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > o
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940243#comment-17940243 ] Rahul Goswami edited comment on SOLR-17725 at 4/19/25 12:00 AM: Attached document outlines an example where the upgrade tool works on an index originally created in Solr 7.x, AFTER an upgrade to Solr 8.x. Key points: 1) Lucene version X can read index created in version X-1. Writing of new segments happens with the latest version codec. 2) When a segment merge happens, the segment maintains a version stamp "minVersion" which is the least version of the segment participating in a merge. 3) The segments_* file in a Lucene index maintains the Lucene version where the index was first created. The design doc outlines the process of converting all segments to the new version. It's sort of a pull model where you first upgrade and then "pull" the index to the current version. By the end of the process outlined in the doc, all segments get converted to the new version and the index in all respects is an "upgraded" index. The only missing piece is to update the index creation version in the commit point. I did this by exposing a method in Lucene's IndexWriter which validates the version of all segments and updates the creation version stamp in the commit point (we might need to request an API from Lucene here). When this index is opened in Solr 9.x, it can read this index (thanks to point #1) and the same process repeats to make the index ready for Solr 10.x. was (Author: [email protected]): Attached document outlines an example where the upgrade tool works on an index originally created in Solr 7.x, AFTER an upgrade to Solr 8.x. Key points: 1) Lucene version X can read index created in version X-1. Writing of new segments happens with the latest version codec. 2) When a segment merge happens, the segment maintains a version stamp "minVersion" which is the least version of the segment participating in a merge. 3) The segments_* file in a Lucene index maintains the Lucene version where the index was first created. The design doc outlines the process of converting all segments to the new version. It's sort of a pull model where you first upgrade and then "pull" the index to the current version. By the end of the process outlined in the doc, all segments get converted to the new version and the index in all respects is an "upgraded" index. The only missing piece is to update the index creation version in the commit point. I did this by exposing a method in Lucene's CommitInfos which validates the version of all segments and updates the creation version stamp in the commit point (we might need to request an API from Lucene here). When this index is opened in Solr 9.x, it can read this index (thanks to point #1) and the same process repeats to make the index ready for Solr 10.x. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396 ] Rahul Goswami edited comment on SOLR-17725 at 4/14/25 3:44 PM: --- Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the segment. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. was (Author: [email protected]): Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the service. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not compl
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396 ] Rahul Goswami edited comment on SOLR-17725 at 4/14/25 3:43 PM: --- Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the service. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. was (Author: [email protected]): Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we are read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the service. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943925#comment-17943925 ] Gus Heck edited comment on SOLR-17725 at 4/13/25 2:17 PM: -- I asked it in the user list thread, but didn't see an answer (sorry if I missed it). As [~ab] also noted, we need to understand what happens if a node fails part way through the process. (i.e. someone kills -9 it, or nobody saw the email from amazon that the hardware underlying the VPC instance needs to be rebooted... etc.) How does the process resume where it left off, or roll back to prevent a corrupted index? was (Author: gus_heck): I asked it in the user list thread, but didn't see an answer (sorry if I missed it). As ab also noted, we need to understand what happens if a node fails part way through the process. (i.e. someone kills -9 it, or nobody saw the email from amazon that the hardware underlying the VPC instance needs to be rebooted... etc.) How does the process resume where it left off, or roll back to prevent a corrupted index? > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami edited comment on SOLR-17725 at 4/7/25 8:11 PM: -- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in CommitInfos to check for > all the segments in a commit and ensure they are the new version in every > aspect before setting the created-version property. This already happens in a > synchronized block upon commit, so in my (limited) opinion, it should be > safe. The API they give us can do all required internal validations and fail > gracefully without any harm to the index. I can get a discussion started with > the Lucene folks once we agree on the basics of this implementation. Or do > you suggest I do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. was (Author: [email protected]): [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941691#comment-17941691 ] Rahul Goswami edited comment on SOLR-17725 at 4/7/25 7:18 PM: -- [~ab] For those running SolrCloud AND having enough capacity in terms of infrastructure and budget, the REINDEXCOLLECTION command is a good option. I see that it reindexes onto a parallel collection. So for clusters with hundreds/thousands of large indexes, that cost can be substantial. Also the source collection is put in read-only mode while the reindexing happens. So can be a point of contention in case of environments which are more update heavy than search heavy (for eg: for us at Commvault). By means of this Jira I am attempting to overcome the Lucene limitation which forces you to reindex from source, when you really don't HAVE to. At least I would like to offer that option to users who are more cost sensitive or operationally sensitive (eg: Solutions which package Solr as part of the application and are installed/deployed on customer sites. It can be awkward to reason with customers as to why a solution upgrade may need a downtime/additional infra capacity if it involves a Solr upgrade). The proposed solution reindexes into the same core, can be easily adapted to work with both standalone Solr and SolrCloud, and allows both updates and searches to be served while doing so. This also helps remove additional operational overhead since now users can focus on just the Solr upgrade without having to worry about index compatibility. was (Author: [email protected]): [~ab] For those running SolrCloud AND having enough capacity in terms of infrastructure and budget, the REINDEXCOLLECTION command is a good option. I see that it reindexes onto a parallel collection. So for clusters with hundreds/thousands of large indexes, that cost can be substantial. Also the source collection is put in read-only mode while the reindexing happens. So can be a point of contention in case of environments which are more update heavy than search heavy (for eg: for us at Commvault). By means of this Jira I am attempting to overcome the Lucene limitation which forces you to reindex from source, when you really don't HAVE to. At least I would like to offer that option to users who are more cost sensitive or operationally sensitive (eg: Solutions which package Solr as part of the application and are installed/deployed on customer sites. It can be awkward to reason with customers as to why a solution upgrade may need a downtime if it involves a Solr upgrade). The proposed solution reindexes into the same core, can be easily adapted to work with both standalone Solr and SolrCloud, and allows both updates and searches to be served while doing so. This also helps remove additional operational overhead since now users can focus on just the Solr upgrade without having to worry about index compatibility. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > hea
[jira] [Comment Edited] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940666#comment-17940666 ] Jan Høydahl edited comment on SOLR-17725 at 4/3/25 11:48 AM: - Please clarify your intent with this Jira before continuting with any code contributions. While I think such a feature would benefit many Solr users, it would be sad to spend lots of time on a particular direction / implementation before higher level questions / designs are clarified. As such, you did the correct ting starting a mailing list thread and a JIRA. My initial questions: * Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? * Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? * Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early * Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? * A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it * Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address was (Author: janhoy): Please clarify your intent with this Jira before continuting with any code contributions. While I think such a feature would benefit many Solr users, it would be sad to spend lots of time on a particular direction / implementation before higher level questions / designs are clarified. As such, you did the correct ting starting a mailing list thread and a JIRA. My initial questions: * Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? * Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges * Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early * Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. It could however be acceptable to implement a "core-level" API first and later a "cluser-level" on top of it * Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-index into a new collection due to major schema changes is also a common use case that this will not address > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environ
