[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17949234#comment-17949234 ] Rahul Goswami commented on SOLR-17725: -- Submitted pull request for the Lucene API change. Fingers crossed! [https://github.com/apache/lucene/pull/14607] > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17947233#comment-17947233 ] Rahul Goswami commented on SOLR-17725: -- Requested the API from Lucene a few days back and the discussion is underway at [https://lists.apache.org/thread/gk3kwplon73llz356szz1mn3myn3nnm3] . Was trying to avoid cross posting , but now thinking it might be ok to copy d...@solr.apache.org on the discussion(?) > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944396#comment-17944396 ] Rahul Goswami commented on SOLR-17725: -- Will do [~dsmiley] Thanks. [~gus] As far as I can see, the current implementation doesn't run the risk of corruption. The status is maintained in two ways: 1) At the core level -> to keep track of which core was being processed when the service went down/killed. A file autoupgrade_status.csv is maintained which is written each time a core is picked up for processing and a status is set for the same. Each time the process resumes it picks up the core with status "REINDEXING_ACTIVE" if any. For SolrCloud, this file can be housed in Zookeeper . This is an implementation detail I am happy to discuss further, but in our (Commvault's) implementation we recognize the following statuses DEFAULT, REINDEXING_ACTIVE, REINDEXING_PAUSED, PROCESSED, ERROR, CORRECTVERSION 2) At the segment level -> This is where we piggyback on Lucene's design and it's beautiful! As we iterate over each segment, we are read the live docs out of the segment, create a SolrInputDocument out of it and reindex using Solr's API. This helps achieve two things: i) A reindexed doc helps mark an existing (old) doc as deleted (when auto-commit kicks in). This way if the service goes down, we don't need to process the already processed docs of the service. And if the service goes down before a commit could be processed, the small penalty is reprocessing the docs of only that segment. ii) When a segment is fully processed, Lucene's DeletionPolicy deletes it reclaiming space in the process. Hence we never process the same segment again. Note that as we do this, we are in no way interfering with Lucene's index structure directly and only interacting by means of APIs. A combination of these factors helps maintain continuity in the processing of a core despite failures, without running the risk of corruption. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943925#comment-17943925 ] Gus Heck commented on SOLR-17725: - I asked it in the user list thread, but didn't see an answer (sorry if I missed it). As ab also noted, we need to understand what happens if a node fails part way through the process. (i.e. someone kills -9 it, or nobody saw the email from amazon that the hardware underlying the VPC instance needs to be rebooted... etc.) How does the process resume where it left off, or roll back to prevent a corrupted index? > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943848#comment-17943848 ] David Smiley commented on SOLR-17725: - Start a Thread in maybe [d...@lucene.apache.org|mailto:d...@lucene.apache.org] to express the API change you would like. Avoid discussing the Solr particulars; Lucene is more foundational. I suspect someone would veto but it's worth asking anyway. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943738#comment-17943738 ] Rahul Goswami commented on SOLR-17725: -- [~janhoy] How do you recommend we proceed here? If you need me to elaborate on any part of the design, I am happy to do so (either here or a discussion over video chat or whatever is the norm with a new feature). If we need a wider audience to take a look at this, I am also happy to float this on the dev list. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941853#comment-17941853 ] Jan Høydahl commented on SOLR-17725: Thanks [~ab] for reminding us about the REINDEXCOLLECTION API. Not sure about its internals, but there is no equivalent REINDEXCORE API, probably since we want to place the new replicas on any node in the cluster, while a /admin/cores?action=REINDEXCORE would be local. But still, adding a REINDEXCORE option for standalone users would solve the upgrade problem on the core level with slightly more required temporary disk space than with the segment approach. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941691#comment-17941691 ] Rahul Goswami commented on SOLR-17725: -- [~ab] For those running SolrCloud AND having enough capacity in terms of infrastructure and budget, the REINDEXCOLLECTION command is a good option. I see that it reindexes onto a parallel collection. So for clusters with hundreds/thousands of large indexes, that cost can be substantial. Also the source collection is put in read-only mode while the reindexing happens. So can be a point of contention in case of environments which are more update heavy than search heavy (for eg: for us at Commvault). By means of this Jira I am attempting to overcome the Lucene limitation which forces you to reindex from source, when you really don't HAVE to. At least I would like to offer that option to users who are more cost sensitive or operationally sensitive (eg: Solutions which package Solr as part of the application and are installed/deployed on customer sites. It can be awkward to reason with customers as to why a solution upgrade may need a downtime if it involves a Solr upgrade). The proposed solution reindexes into the same core, can be easily adapted to work with both standalone Solr and SolrCloud, and allows both updates and searches to be served while doing so. This also helps remove additional operational overhead since now users can focus on just the Solr upgrade without having to worry about index compatibility. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704 ] Rahul Goswami commented on SOLR-17725: -- [~janhoy] Thanks for taking the time to review the JIRA. Please find my thoughts on your questions below: 1) Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? > The implementation needs to run on a hot index for it to be lossless. > Indexing calls happen using Solr APIs so Solr will need to be running. In our > custom implementation I have hooked the process into SolrDispatchFilter > load() so that the process can start upon server start for least operational > overhead. As a generic solution I am thinking we can expose it as an action > (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for > trackability. This way users can hook up the command into their shell/cmd > scripts after Solr starts. Open to suggestions here, 2) Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges? > Reducing infrastructure costs is a major design goal here. Also removing the > operational overhead of index uprgade during Solr uprgade when possible. 3) Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early > I agree. I am hopeful(!!) this will not be rejected though since they can > implement guardrails around changing the "created-version" property for added > security. In my implementation I added the change in CommitInfos to check for > all the segments in a commit and ensure they are the new version in every > aspect before setting the created-version property. This already happens in a > synchronized block so in my (limited) opinion, it should be safe. The API > they give us can do all required internal validations and fail gracefully > without any harm to the index. I can get a discussion started with the Lucene > folks once we agree on the basics of this implemetation. Or do you suggest I > do that right away? 4) Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. Have you thought about locking / conflicts? > SolrCloud challenges are not factored into the current implementation. But > given the process works at Core level and agnostic of the mode, I am > optimistic we can adapt the solution for SolrCloud through PR discussions. We might have to block certain operations like splitshard while this process is underway on a collection. 5) A reindex-collection API is probably wanted, however it could be acceptable to implement a "core-level" API first and later add a "collection-level" API on top of it > Agreed 6) Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-indexing into a new collection due to major schema changes is also a common use case that this will not address > I would revert to my answer to your second question in defense of the > "in-place" implementation. Segment level processing gives us the ability to > restrict pollution of index due to merges as we reindex and also > restartability. Agreed this is not a substitute for when a field data type changes. This is intended to be a substitute for index upgrade when you upgrade Solr so as to overcome the X --> X+1 --> X+2 version upgrade path limitation which exists today despite no schema changes. Of course, users are free to add new fields and should still be able to use this utility. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multipl
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941033#comment-17941033 ] Andrzej Bialecki commented on SOLR-17725: - The already existing {{REINDEXCOLLECTION}} command supports a similar usecase and has similar requirements for source fields that are needed to re-create the original {{{}SolrIndexDocument{}}}s. It doesn't work in place but depending on the use case this can be viewed as a feature not a bug ;) because it's easier to preserve the original data in case of a catastrophic failure. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940666#comment-17940666 ] Jan Høydahl commented on SOLR-17725: Please clarify your intent with this Jira before continuting with any code contributions. While I think such a feature would benefit many Solr users, it would be sad to spend lots of time on a particular direction / implementation before higher level questions / designs are clarified. As such, you did the correct ting starting a mailing list thread and a JIRA. My initial questions: * Do you intend for this to be a new Solr API, if so what is the proposed API? or a CLI utility tool to run on a cold index folder? * Is one of your design goals to avoid the need for 2-3x disk space during the reindex, since you work on segment level and do merges * Requring Lucene API change is a potential blocker, I'd not be surprised if the Lucene project rejects making the "created-version" property writable, so such a discussion with them would come early * Obviously a new Solr API needs to play well with SolrCloud as well as other features such such as shard split / move etc. It could however be acceptable to implement a "core-level" API first and later a "cluser-level" on top of it * Challenge the assumption that "in-place" segment level is the best choice for this feature. Re-index into a new collection due to major schema changes is also a common use case that this will not address > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17725) Automatically upgrade Solr indexes without needing to reindex from source
[ https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17940243#comment-17940243 ] Rahul Goswami commented on SOLR-17725: -- Attached document outlines an example where the upgrade tool works on an index originally created in Solr 7.x, AFTER an upgrade to Solr 8.x. Key points: 1) Lucene version X can read index created in version X-1. Writing of new segments happens with the latest version codec. 2) When a segment merge happens, the segment maintains a version stamp "minVersion" which is the least version of the segment participating in a merge. 3) The segments_* file in a Lucene index maintains the Lucene version where the index was first created. The design doc outlines the process of converting all segments to the new version. It's sort of a pull model where you first upgrade and then "pull" the index to the current version. By the end of the process outlined in the doc, all segments get converted to the new version and the index in all respects is an "upgraded" index. The only missing piece is to update the index creation version in the commit point. I did this by exposing a method in Lucene's CommitInfos which validates the version of all segments and updates the creation version stamp in the commit point (we might need to request an API from Lucene here). When this index is opened in Solr 9.x, it can read this index (thanks to point #1) and the same process repeats to make the index ready for Solr 10.x. > Automatically upgrade Solr indexes without needing to reindex from source > - > > Key: SOLR-17725 > URL: https://issues.apache.org/jira/browse/SOLR-17725 > Project: Solr > Issue Type: Improvement >Reporter: Rahul Goswami >Priority: Major > Attachments: High Level Design.png > > > Today upgrading from Solr version X to X+2 requires complete reingestion of > data from source. This comes from Lucene's constraint which only guarantees > index compatibility between the version the index was created in and the > immediate next version. > This reindexing usually comes with added downtime and/or cost. Especially in > case of deployments which are in customer environments and not completely in > control of the vendor, this proposition of having to completely reindex the > data can become a hard sell. > I, on behalf of my employer, Commvault, have developed a way which achieves > this reindexing in-place on the same index. Also, the process automatically > keeps "upgrading" the indexes over multiple subsequent Solr upgrades without > needing manual intervention. > It comes with the following limitations: > i) All _source_ fields need to be either stored=true or docValues=true. Any > copyField destination fields can be stored=false of course, just that the > source fields (or more precisely, the source fields you care about > preserving) should be either stored or docValues true. > ii) The datatype of an existing field in schema.xml shouldn't change upon > Solr upgrade. Introducing new fields is fine. > For indexes where this limitation is not a problem (it wasn't for us!), the > tool can reindex in-place on the same core with zero downtime and > legitimately "upgrade" the index. This can remove a lot of operational > headaches, especially in environments with hundreds/thousands of very large > indexes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org