Thanks, Adam. And thanks for the tip about the view header, Bob. Wonder if a disk version would make sense for views. Noticed Eric did a nice job transparently migrating 2.x -> 3.x view files when we removed key seq indices. Perhaps something like that would work for adding a collator version.
Cheers, -Nick On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <kocol...@apache.org> wrote: > > That seems like a smart solution Nick. > > Adam > > > On Nov 19, 2021, at 7:28 AM, Robert Newson <b...@rsn.io> wrote: > > > > Noting that the upgrade channel for views was misconceived (by me) as there > > is no version number in the header for them. You’d need to add it. > > > > B. > > > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <vatam...@gmail.com> wrote: > >> > >> Thinking more about this issue I wonder if we can avoid resetting and > >> rebuilding everything from scratch, and instead, let the upgrade > >> happen in the background, while still serving the existing view data. > >> > >> The realization was that collation doesn't affect the emitted keys and > >> values themselves, only their order in the view b-trees. That means > >> we'd just have to rebuild b-trees, and that is exactly what our view > >> compactor already does. > >> > >> When we detect a libicu version discrepancy we'd submit the view for > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh > >> which handles file version format upgrades, but we'll tweak that logic > >> to trigger on libicu version mismatches as well. > >> > >> Would this work? Does anyone see any issue with that approach? > >> > >> [1] > >> https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442 > >> > >> Cheers, > >> -Nick > >> > >> > >> > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <vatam...@apache.org> > >>> wrote: > >>> > >>> Hello everyone, > >>> > >>> CouchDB by default uses the libicu library to sort its view rows. > >>> When views are built, we do not record or track the version of the > >>> collation algorithm. The issue is that the ICU library may modify the > >>> collation order between major libicu versions, and when that happens, > >>> views built with the older versions may experience data loss. I wanted > >>> to discuss the option to record the libicu collator version in each > >>> view then warn the user when there is a mismatch. Also, optionally > >>> ignore the mismatch, or automatically rebuild the views. > >>> > >>> Imagine, for example, searching patient records using start/end keys. > >>> It could be possible that, say, the first letter of their name now > >>> collates differently in a new libicu. That would prevent the patient > >>> record from showing up in the view results for some important > >>> procedure or medication. Users might not even be aware of this kind of > >>> data loss occurring, there won't be any error in the API or warning in > >>> the logs. > >>> > >>> I was thinking how to solve this. There were a few commits already to > >>> cleanup our collation drivers [1], expose libicu and collation > >>> algorithm version in the new _versions endpoint [2], and some other > >>> minor fixes in that area. As the next steps we could: > >>> > >>> 1) Modify our views to keep track of the collation algorithm > >>> version. We could attempt to transparently upgrade the view header > >>> format -- read the old view file, update the header with an extra > >>> libicu collation version field, that updates the signature, and then, > >>> save the file with the new header and new signature. This avoids view > >>> rebuilds, just records the collator version in the view and moves the > >>> files to a new name. > >>> > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view > >>> results when the current libicu version doesn't match the version in > >>> the view [3]. That means altering the view results to add a "warning": > >>> "..." field. Another alternative 2b) is emit a warning in the > >>> _design/$ddoc/_info only. Users would have to know that after an OS > >>> version upgrade, or restoring backups, to make sure to look at their > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be > >>> users which used the "raw" collation option, or know they are using > >>> just the plain ASCII character sets in their views. So we'd have a > >>> configuration setting to ignore the warnings as well. > >>> > >>> 3) Users who see the warning, could then either rebuild the view > >>> with the new collator library manually, or it could happen > >>> automatically based on a configuration option, basically "when > >>> collator versions are miss-matched, invalidate and rebuild all the > >>> views". > >>> > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that > >>> they double-checked the new ICU version and are convinced that a > >>> particular view would not experience data loss with the new collator. > >>> That should make the warning go away, and the view to not be rebuilt. > >>> This can't be just a naive "collator" option setting as both per-view > >>> and per-design options are used when computing the view signature, and > >>> any changes there would result in the view being rebuilt. Perhaps we > >>> can add it to the design docs as a separate option which is excluded > >>> from the signature hash, like the "autoupdate" setting for background > >>> index builder ("collation_version_accept"?). PostgreSQL also offers > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3] > >>> > >>> What do we think, is this a reasonable approach? Is there something > >>> easier / simpler we can do? > >>> > >>> Thanks! > >>> -Nick > >>> > >>> [1] > >>> https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061 > >>> [2] > >>> https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333 > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html > > >