[
https://issues.apache.org/jira/browse/SOLR-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941704#comment-17941704
]
Rahul Goswami commented on SOLR-17725:
--------------------------------------
[~janhoy] Thanks for taking the time to review the JIRA. Please find my
thoughts on your questions below:
1) Do you intend for this to be a new Solr API, if so what is the proposed API?
or a CLI utility tool to run on a cold index folder?
> The implementation needs to run on a hot index for it to be lossless.
> Indexing calls happen using Solr APIs so Solr will need to be running. In our
> custom implementation I have hooked the process into SolrDispatchFilter
> load() so that the process can start upon server start for least operational
> overhead. As a generic solution I am thinking we can expose it as an action
> (/solr/admin/cores?action=UPGRADEINDEXES) with an "async" option for
> trackability. This way users can hook up the command into their shell/cmd
> scripts after Solr starts. Open to suggestions here,
2) Is one of your design goals to avoid the need for 2-3x disk space during the
reindex, since you work on segment level and do merges?
> Reducing infrastructure costs is a major design goal here. Also removing the
> operational overhead of index uprgade during Solr uprgade when possible.
3) Requring Lucene API change is a potential blocker, I'd not be surprised if
the Lucene project rejects making the "created-version" property writable, so
such a discussion with them would come early
> I agree. I am hopeful(!!) this will not be rejected though since they can
> implement guardrails around changing the "created-version" property for added
> security. In my implementation I added the change in CommitInfos to check for
> all the segments in a commit and ensure they are the new version in every
> aspect before setting the created-version property. This already happens in a
> synchronized block so in my (limited) opinion, it should be safe. The API
> they give us can do all required internal validations and fail gracefully
> without any harm to the index. I can get a discussion started with the Lucene
> folks once we agree on the basics of this implemetation. Or do you suggest I
> do that right away?
4) Obviously a new Solr API needs to play well with SolrCloud as well as other
features such such as shard split / move etc. Have you thought about locking /
conflicts?
> SolrCloud challenges are not factored into the current implementation. But
> given the process works at Core level and agnostic of the mode, I am
> optimistic we can adapt the solution for SolrCloud through PR discussions.
We might have to block certain operations like splitshard while this process is
underway on a collection.
5) A reindex-collection API is probably wanted, however it could be acceptable
to implement a "core-level" API first and later add a "collection-level" API on
top of it
> Agreed
6) Challenge the assumption that "in-place" segment level is the best choice
for this feature. Re-indexing into a new collection due to major schema changes
is also a common use case that this will not address
> I would revert to my answer to your second question in defense of the
> "in-place" implementation. Segment level processing gives us the ability to
> restrict pollution of index due to merges as we reindex and also
> restartability.
Agreed this is not a substitute for when a field data type changes. This is
intended to be a substitute for index upgrade when you upgrade Solr so as to
overcome the X --> X+1 --> X+2 version upgrade path limitation which exists
today despite no schema changes. Of course, users are free to add new fields
and should still be able to use this utility.
> Automatically upgrade Solr indexes without needing to reindex from source
> -------------------------------------------------------------------------
>
> Key: SOLR-17725
> URL: https://issues.apache.org/jira/browse/SOLR-17725
> Project: Solr
> Issue Type: Improvement
> Reporter: Rahul Goswami
> Priority: Major
> Attachments: High Level Design.png
>
>
> Today upgrading from Solr version X to X+2 requires complete reingestion of
> data from source. This comes from Lucene's constraint which only guarantees
> index compatibility between the version the index was created in and the
> immediate next version.
> This reindexing usually comes with added downtime and/or cost. Especially in
> case of deployments which are in customer environments and not completely in
> control of the vendor, this proposition of having to completely reindex the
> data can become a hard sell.
> I, on behalf of my employer, Commvault, have developed a way which achieves
> this reindexing in-place on the same index. Also, the process automatically
> keeps "upgrading" the indexes over multiple subsequent Solr upgrades without
> needing manual intervention.
> It comes with the following limitations:
> i) All _source_ fields need to be either stored=true or docValues=true. Any
> copyField destination fields can be stored=false of course, just that the
> source fields (or more precisely, the source fields you care about
> preserving) should be either stored or docValues true.
> ii) The datatype of an existing field in schema.xml shouldn't change upon
> Solr upgrade. Introducing new fields is fine.
> For indexes where this limitation is not a problem (it wasn't for us!), the
> tool can reindex in-place on the same core with zero downtime and
> legitimately "upgrade" the index. This can remove a lot of operational
> headaches, especially in environments with hundreds/thousands of very large
> indexes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]