Thinking more about this issue I wonder if we can avoid resetting and
rebuilding everything from scratch, and instead, let the upgrade
happen in the background, while still serving the existing view data.

The realization was that collation doesn't affect the emitted keys and
values themselves, only their order in the view b-trees. That means
we'd just have to rebuild b-trees, and that is exactly what our view
compactor already does.

When we detect a libicu version discrepancy we'd submit the view for
compaction. We even have a dedicated "upgrade" [1] channel in smoosh
which handles file version format upgrades, but we'll tweak that logic
to trigger on libicu version mismatches as well.

Would this work? Does anyone see any issue with that approach?

[1] 
https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442

Cheers,
-Nick



On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <vatam...@apache.org> wrote:
>
> Hello everyone,
>
> CouchDB by default uses the libicu library to sort its view rows.
> When views are built, we do not record or track the version of the
> collation algorithm. The issue is that the ICU library may modify the
> collation order between major libicu versions, and when that happens,
> views built with the older versions may experience data loss. I wanted
> to discuss the option to record the libicu collator version in each
> view then warn the user when there is a mismatch. Also, optionally
> ignore the mismatch, or automatically rebuild the views.
>
> Imagine, for example, searching patient records using start/end keys.
> It could be possible that, say, the first letter of their name now
> collates differently in a new libicu. That would prevent the patient
> record from showing up in the view results for some important
> procedure or medication. Users might not even be aware of this kind of
> data loss occurring, there won't be any error in the API or warning in
> the logs.
>
> I was thinking how to solve this. There were a few commits already to
> cleanup our collation drivers [1], expose libicu and collation
> algorithm version in the new _versions endpoint [2], and some other
> minor fixes in that area. As the next steps we could:
>
>   1) Modify our views to keep track of the collation algorithm
> version. We could attempt to transparently upgrade the view header
> format -- read the old view file, update the header with an extra
> libicu collation version field, that updates the signature, and then,
> save the file with the new header and new signature. This avoids view
> rebuilds, just records the collator version in the view and moves the
> files to a new name.
>
>   2) Do what PostgreSQL does, and 2a) emit a warning with the view
> results when the current libicu version doesn't match the version in
> the view [3]. That means altering the view results to add a "warning":
> "..." field. Another alternative 2b) is emit a warning in the
> _design/$ddoc/_info only. Users would have to know that after an OS
> version upgrade, or restoring backups, to make sure to look at their
> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
> users which used the "raw" collation option, or know they are using
> just the plain ASCII character sets in their views. So we'd have a
> configuration setting to ignore the warnings as well.
>
>   3) Users who see the warning, could then either rebuild the view
> with the new collator library manually, or it could happen
> automatically based on a configuration option, basically "when
> collator versions are miss-matched, invalidate and rebuild all the
> views".
>
>   4) We'd have a way for the users to assert (POST a ddoc update) that
> they double-checked the new ICU version and are convinced that a
> particular view would not experience data loss with the new collator.
> That should make the warning go away, and the view to not be rebuilt.
> This can't be just a naive "collator" option setting as both per-view
> and per-design options are used when computing the view signature, and
> any changes there would result in the view being rebuilt. Perhaps we
> can add it to the design docs as a separate option which is excluded
> from the signature hash, like the "autoupdate" setting for background
> index builder ("collation_version_accept"?). PostgreSQL also offers
> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
>
> What do we think, is this a reasonable approach? Is there something
> easier / simpler we can do?
>
> Thanks!
> -Nick
>
> [1] 
> https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> [2] 
> https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> [3] https://www.postgresql.org/docs/13/sql-altercollation.html

Reply via email to