Hi Nick,

  The view _info setup looks good to me. Maybe it would be helpful to
print the current runtime's collator and icu versions somewhere like
the / meta or  /node/ _system endpoint? I think that would provide a
way to cross-reference to alleviate the drawback of the collator being
the least human readable version (though only to find a more readable
version for the views that are from the current runtime,) and maybe to
debug oddities like a cluster somehow having nodes that are out of
sync on libicu versions, or just to make it easier to check if dbs are
going to be rebuilt after an update. Of course there are also other
ways for an admin to examine the current runtime and workout versions
so it is probably a question of how frequently it will come up.

Thanks,
Will

Am Di., 25. Jan. 2022 um 07:45 Uhr schrieb Nick Vatamaniuc <vatam...@gmail.com>:
>
> Another update regarding the draft PR.
>
> There are now upgrade tests to see how we handle older 2.x, 3.2.1, and
> views with multiple collator versions in them.
>
> The last commit modifies the _design/*/_info API to return the list of
> collator versions to the user and wanted to see what everyone thought
> about it:
>
> https://github.com/apache/couchdb/pull/3889#issuecomment-1020861208
>
> Thanks,
> -Nick
>
> On Tue, Jan 11, 2022 at 1:06 PM Nick Vatamaniuc <vatam...@gmail.com> wrote:
> >
> > I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> >
> > Would that work? There are two tricks there - re-using a field
> > position from an older <2.3.1 format, this should allow transparently
> > downgrading back to 3.2.1 as we ignore that field there. Also, used a
> > map  so it should allow adding extra info to the views in the future
> > (custom collation tailorings?).
> >
> > Thanks,
> > -Nick
> >
> > On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <vatam...@gmail.com> wrote:
> > >
> > > Thanks, Adam. And thanks for the tip about the view header, Bob.
> > >
> > > Wonder if a disk version would make sense for views. Noticed Eric did
> > > a nice job transparently migrating 2.x -> 3.x view files when we
> > > removed key seq indices. Perhaps something like that would work for
> > > adding a collator version.
> > >
> > > Cheers,
> > > -Nick
> > >
> > > On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <kocol...@apache.org> 
> > > wrote:
> > > >
> > > > That seems like a smart solution Nick.
> > > >
> > > > Adam
> > > >
> > > > > On Nov 19, 2021, at 7:28 AM, Robert Newson <b...@rsn.io> wrote:
> > > > >
> > > > > Noting that the upgrade channel for views was misconceived (by me) as 
> > > > > there is no version number in the header for them. You’d need to add 
> > > > > it.
> > > > >
> > > > > B.
> > > > >
> > > > >> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <vatam...@gmail.com> wrote:
> > > > >>
> > > > >> Thinking more about this issue I wonder if we can avoid resetting 
> > > > >> and
> > > > >> rebuilding everything from scratch, and instead, let the upgrade
> > > > >> happen in the background, while still serving the existing view data.
> > > > >>
> > > > >> The realization was that collation doesn't affect the emitted keys 
> > > > >> and
> > > > >> values themselves, only their order in the view b-trees. That means
> > > > >> we'd just have to rebuild b-trees, and that is exactly what our view
> > > > >> compactor already does.
> > > > >>
> > > > >> When we detect a libicu version discrepancy we'd submit the view for
> > > > >> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
> > > > >> which handles file version format upgrades, but we'll tweak that 
> > > > >> logic
> > > > >> to trigger on libicu version mismatches as well.
> > > > >>
> > > > >> Would this work? Does anyone see any issue with that approach?
> > > > >>
> > > > >> [1] 
> > > > >> https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > > >>
> > > > >> Cheers,
> > > > >> -Nick
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc 
> > > > >>> <vatam...@apache.org> wrote:
> > > > >>>
> > > > >>> Hello everyone,
> > > > >>>
> > > > >>> CouchDB by default uses the libicu library to sort its view rows.
> > > > >>> When views are built, we do not record or track the version of the
> > > > >>> collation algorithm. The issue is that the ICU library may modify 
> > > > >>> the
> > > > >>> collation order between major libicu versions, and when that 
> > > > >>> happens,
> > > > >>> views built with the older versions may experience data loss. I 
> > > > >>> wanted
> > > > >>> to discuss the option to record the libicu collator version in each
> > > > >>> view then warn the user when there is a mismatch. Also, optionally
> > > > >>> ignore the mismatch, or automatically rebuild the views.
> > > > >>>
> > > > >>> Imagine, for example, searching patient records using start/end 
> > > > >>> keys.
> > > > >>> It could be possible that, say, the first letter of their name now
> > > > >>> collates differently in a new libicu. That would prevent the patient
> > > > >>> record from showing up in the view results for some important
> > > > >>> procedure or medication. Users might not even be aware of this kind 
> > > > >>> of
> > > > >>> data loss occurring, there won't be any error in the API or warning 
> > > > >>> in
> > > > >>> the logs.
> > > > >>>
> > > > >>> I was thinking how to solve this. There were a few commits already 
> > > > >>> to
> > > > >>> cleanup our collation drivers [1], expose libicu and collation
> > > > >>> algorithm version in the new _versions endpoint [2], and some other
> > > > >>> minor fixes in that area. As the next steps we could:
> > > > >>>
> > > > >>> 1) Modify our views to keep track of the collation algorithm
> > > > >>> version. We could attempt to transparently upgrade the view header
> > > > >>> format -- read the old view file, update the header with an extra
> > > > >>> libicu collation version field, that updates the signature, and 
> > > > >>> then,
> > > > >>> save the file with the new header and new signature. This avoids 
> > > > >>> view
> > > > >>> rebuilds, just records the collator version in the view and moves 
> > > > >>> the
> > > > >>> files to a new name.
> > > > >>>
> > > > >>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > > >>> results when the current libicu version doesn't match the version in
> > > > >>> the view [3]. That means altering the view results to add a 
> > > > >>> "warning":
> > > > >>> "..." field. Another alternative 2b) is emit a warning in the
> > > > >>> _design/$ddoc/_info only. Users would have to know that after an OS
> > > > >>> version upgrade, or restoring backups, to make sure to look at their
> > > > >>> _design/$ddoc/_info for each db for each ddoc. Of course, there may 
> > > > >>> be
> > > > >>> users which used the "raw" collation option, or know they are using
> > > > >>> just the plain ASCII character sets in their views. So we'd have a
> > > > >>> configuration setting to ignore the warnings as well.
> > > > >>>
> > > > >>> 3) Users who see the warning, could then either rebuild the view
> > > > >>> with the new collator library manually, or it could happen
> > > > >>> automatically based on a configuration option, basically "when
> > > > >>> collator versions are miss-matched, invalidate and rebuild all the
> > > > >>> views".
> > > > >>>
> > > > >>> 4) We'd have a way for the users to assert (POST a ddoc update) that
> > > > >>> they double-checked the new ICU version and are convinced that a
> > > > >>> particular view would not experience data loss with the new 
> > > > >>> collator.
> > > > >>> That should make the warning go away, and the view to not be 
> > > > >>> rebuilt.
> > > > >>> This can't be just a naive "collator" option setting as both 
> > > > >>> per-view
> > > > >>> and per-design options are used when computing the view signature, 
> > > > >>> and
> > > > >>> any changes there would result in the view being rebuilt. Perhaps we
> > > > >>> can add it to the design docs as a separate option which is excluded
> > > > >>> from the signature hash, like the "autoupdate" setting for 
> > > > >>> background
> > > > >>> index builder ("collation_version_accept"?). PostgreSQL also offers
> > > > >>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
> > > > >>>
> > > > >>> What do we think, is this a reasonable approach? Is there something
> > > > >>> easier / simpler we can do?
> > > > >>>
> > > > >>> Thanks!
> > > > >>> -Nick
> > > > >>>
> > > > >>> [1] 
> > > > >>> https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > > >>> [2] 
> > > > >>> https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > > >>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > > >
> > > >

Reply via email to