Interesting idea, Will, about possibly using the collation functions
from the query server side. There is no current mechanism to do it;
we'd have to invent it.

If we could reliably detect the couchjs libicu library version, we
could try to track it separately from the libicu used to sort view
keys. But don't think it's exposed as a JS library call (like we have
for get_libicu_version in the NIF module). But if we tracked it, and
there was a version mismatch, we wouldn't even be able to use the
trick from above to recompact the view, and we'd have to fully reset
it.

I noticed in your link how there is a mode to disable libicu linking
'--without-intl-api',  which turns off some APIs on the JS side. One
way to ensure we don't need to track libicu versions linked to the
collator is to disable its usage :-)  At first it seems rather
unusual, however it could provide some stability guarantee about the
views not becoming invalid after couchjs is upgraded. (There is of
course the chance that the other APIs users used in the new JS engine
somehow generate different data on the newer engine, which would also
invalidate the old views. It would have to be libicu, say some math
operations or one other string processing functions).

Perhaps we should just track couchjs versions and engine types in the
view file headers like we're starting to do with libicu versions? I
feel like we might need that at some point, but also it feels like a
future effort. Since we'd have to handle full view resets, warnings,
user assertions about their view / js engine being compatible etc.

Cheers,
-Nick

On Thu, Jan 13, 2022 at 12:14 PM Will Young <lostnetwork...@gmail.com> wrote:
>
> I would be a little hesitant to rely on the version mozilla wraps up
> reliably working since they modify it a little and don't actually use
> its normal build system:
> https://firefox-source-docs.mozilla.org/intl/icu.html#internationalization-in-spidermonkey-and-gecko
>
> That icu->spidermonkey->couch creates the longest and most fragile
> dependency chain when doing a full windows build is frustrating
> especially since its not really clear if anyone would be making use of
> the Intl, string.prototype.normalization() or similar functionality in
> spidermonkey. If C functions to access things from icu were being
> explicitly registered in the JS context from the query server code
> that could break this dependency chain and be able to disable intl and
> know what to do explicitly when icu really should be used. That would
> also make it easier to replace spidermonkey with a more minimalist JS
> interpreter.
>
> -Will
>
> Am Do., 13. Jan. 2022 um 17:38 Uhr schrieb Nick Vatamaniuc 
> <vatam...@gmail.com>:
> >
> > Hi Ronny,
> >
> > If it makes it easier to build on some platforms it could make sense.
> > Or find some way for both of them to point to a single libicu library.
> >
> > On some OSes (ex. Linux distros), dynamically linking to a system
> > libicu also makes sense because it's often the easiest way to get
> > security updates. libicu has had quite a number of high risk CVEs over
> > the years [1]
> >
> > [1] https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=icu
> >
> > -Nick
> >
> > On Wed, Jan 12, 2022 at 2:47 PM Ronny Berndt
> > <ro...@kioskkinder.com.invalid> wrote:
> > >
> > > Hi,
> > >
> > > to prevent different versions of the ICU libs, why don’t use the shipped 
> > > version
> > > of the libs from the spidermonkey tree (use only esr versions) and link 
> > > against those in the build process
> > > and don’t rely on the system version?
> > >
> > > The windows version of CouchDB isn’t available for the actual version and 
> > > the build process for this
> > > os stucks at the moment. Maybe it is a broader discussion and maybe it is 
> > > a good idea to combine
> > > this with the erlang version update process ([DISCUSS] Erlang version 
> > > update process for convenience binaries).
> > >
> > > - Ronny
> > >
> > > > Am 12.01.2022 um 16:31 schrieb Will Young <lostnetwork...@gmail.com>:
> > > >
> > > > Hi Nick,
> > > >
> > > >  I like the way this breaks down the problem into something that can
> > > > work with the existing maintenance mechanisms. On the UCA version it
> > > > looks to me like the major version tracks the last unicode version
> > > > that had a collation change (version 9.0?), while the ICU version is
> > > > changing with each release which would be more frequent than actual
> > > > collation changes. Looking at the ICU release notes I get the
> > > > impression that the frequency of change may inbetween because of bug
> > > > fixes or additions to unicode that directly get a differing order in
> > > > the root collation. I.e. ICU 54 seems like a clean match of UCA
> > > > version and collation change while it seems like 59 could have changed
> > > > some emoji sort orders that may already have been reflected in 58's
> > > > UCA version?
> > > >
> > > > Another question I have about ICU synchronization is spidermonkey's
> > > > use of ICU. Since all build instructions keep erlang and mozjs'
> > > > linking to the same system ICU, I think there could never be a need to
> > > > record an ICU related version from the query server, but I've never
> > > > seen instructions to set locales in relation the query server or do
> > > > anything to ensure a function is using the root collator, so I don't
> > > > think the build setup reflects an actual need for spidermonkey to be
> > > > truly in sync on aspects of icu like collation setup and everything
> > > > important is happening in the erlang/nifs?
> > > > Thanks,
> > > > -Will
> > > >
> > > > Am Di., 11. Jan. 2022 um 19:06 Uhr schrieb Nick Vatamaniuc 
> > > > <vatam...@gmail.com>:
> > > >>
> > > >> I threw together a draft PR https://github.com/apache/couchdb/pull/3889
> > > >>
> > > >> Would that work? There are two tricks there - re-using a field
> > > >> position from an older <2.3.1 format, this should allow transparently
> > > >> downgrading back to 3.2.1 as we ignore that field there. Also, used a
> > > >> map  so it should allow adding extra info to the views in the future
> > > >> (custom collation tailorings?).
> > > >>
> > > >> Thanks,
> > > >> -Nick
> > > >>
> > > >> On Sat, Nov 20, 2021 at 12:32 PM Nick Vatamaniuc <vatam...@gmail.com> 
> > > >> wrote:
> > > >>>
> > > >>> Thanks, Adam. And thanks for the tip about the view header, Bob.
> > > >>>
> > > >>> Wonder if a disk version would make sense for views. Noticed Eric did
> > > >>> a nice job transparently migrating 2.x -> 3.x view files when we
> > > >>> removed key seq indices. Perhaps something like that would work for
> > > >>> adding a collator version.
> > > >>>
> > > >>> Cheers,
> > > >>> -Nick
> > > >>>
> > > >>> On Fri, Nov 19, 2021 at 9:09 AM Adam Kocoloski <kocol...@apache.org> 
> > > >>> wrote:
> > > >>>>
> > > >>>> That seems like a smart solution Nick.
> > > >>>>
> > > >>>> Adam
> > > >>>>
> > > >>>>> On Nov 19, 2021, at 7:28 AM, Robert Newson <b...@rsn.io> wrote:
> > > >>>>>
> > > >>>>> Noting that the upgrade channel for views was misconceived (by me) 
> > > >>>>> as there is no version number in the header for them. You’d need to 
> > > >>>>> add it.
> > > >>>>>
> > > >>>>> B.
> > > >>>>>
> > > >>>>>> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <vatam...@gmail.com> 
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>> Thinking more about this issue I wonder if we can avoid resetting 
> > > >>>>>> and
> > > >>>>>> rebuilding everything from scratch, and instead, let the upgrade
> > > >>>>>> happen in the background, while still serving the existing view 
> > > >>>>>> data.
> > > >>>>>>
> > > >>>>>> The realization was that collation doesn't affect the emitted keys 
> > > >>>>>> and
> > > >>>>>> values themselves, only their order in the view b-trees. That means
> > > >>>>>> we'd just have to rebuild b-trees, and that is exactly what our 
> > > >>>>>> view
> > > >>>>>> compactor already does.
> > > >>>>>>
> > > >>>>>> When we detect a libicu version discrepancy we'd submit the view 
> > > >>>>>> for
> > > >>>>>> compaction. We even have a dedicated "upgrade" [1] channel in 
> > > >>>>>> smoosh
> > > >>>>>> which handles file version format upgrades, but we'll tweak that 
> > > >>>>>> logic
> > > >>>>>> to trigger on libicu version mismatches as well.
> > > >>>>>>
> > > >>>>>> Would this work? Does anyone see any issue with that approach?
> > > >>>>>>
> > > >>>>>> [1] 
> > > >>>>>> https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
> > > >>>>>>
> > > >>>>>> Cheers,
> > > >>>>>> -Nick
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc 
> > > >>>>>>> <vatam...@apache.org> wrote:
> > > >>>>>>>
> > > >>>>>>> Hello everyone,
> > > >>>>>>>
> > > >>>>>>> CouchDB by default uses the libicu library to sort its view rows.
> > > >>>>>>> When views are built, we do not record or track the version of the
> > > >>>>>>> collation algorithm. The issue is that the ICU library may modify 
> > > >>>>>>> the
> > > >>>>>>> collation order between major libicu versions, and when that 
> > > >>>>>>> happens,
> > > >>>>>>> views built with the older versions may experience data loss. I 
> > > >>>>>>> wanted
> > > >>>>>>> to discuss the option to record the libicu collator version in 
> > > >>>>>>> each
> > > >>>>>>> view then warn the user when there is a mismatch. Also, optionally
> > > >>>>>>> ignore the mismatch, or automatically rebuild the views.
> > > >>>>>>>
> > > >>>>>>> Imagine, for example, searching patient records using start/end 
> > > >>>>>>> keys.
> > > >>>>>>> It could be possible that, say, the first letter of their name now
> > > >>>>>>> collates differently in a new libicu. That would prevent the 
> > > >>>>>>> patient
> > > >>>>>>> record from showing up in the view results for some important
> > > >>>>>>> procedure or medication. Users might not even be aware of this 
> > > >>>>>>> kind of
> > > >>>>>>> data loss occurring, there won't be any error in the API or 
> > > >>>>>>> warning in
> > > >>>>>>> the logs.
> > > >>>>>>>
> > > >>>>>>> I was thinking how to solve this. There were a few commits 
> > > >>>>>>> already to
> > > >>>>>>> cleanup our collation drivers [1], expose libicu and collation
> > > >>>>>>> algorithm version in the new _versions endpoint [2], and some 
> > > >>>>>>> other
> > > >>>>>>> minor fixes in that area. As the next steps we could:
> > > >>>>>>>
> > > >>>>>>> 1) Modify our views to keep track of the collation algorithm
> > > >>>>>>> version. We could attempt to transparently upgrade the view header
> > > >>>>>>> format -- read the old view file, update the header with an extra
> > > >>>>>>> libicu collation version field, that updates the signature, and 
> > > >>>>>>> then,
> > > >>>>>>> save the file with the new header and new signature. This avoids 
> > > >>>>>>> view
> > > >>>>>>> rebuilds, just records the collator version in the view and moves 
> > > >>>>>>> the
> > > >>>>>>> files to a new name.
> > > >>>>>>>
> > > >>>>>>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
> > > >>>>>>> results when the current libicu version doesn't match the version 
> > > >>>>>>> in
> > > >>>>>>> the view [3]. That means altering the view results to add a 
> > > >>>>>>> "warning":
> > > >>>>>>> "..." field. Another alternative 2b) is emit a warning in the
> > > >>>>>>> _design/$ddoc/_info only. Users would have to know that after an 
> > > >>>>>>> OS
> > > >>>>>>> version upgrade, or restoring backups, to make sure to look at 
> > > >>>>>>> their
> > > >>>>>>> _design/$ddoc/_info for each db for each ddoc. Of course, there 
> > > >>>>>>> may be
> > > >>>>>>> users which used the "raw" collation option, or know they are 
> > > >>>>>>> using
> > > >>>>>>> just the plain ASCII character sets in their views. So we'd have a
> > > >>>>>>> configuration setting to ignore the warnings as well.
> > > >>>>>>>
> > > >>>>>>> 3) Users who see the warning, could then either rebuild the view
> > > >>>>>>> with the new collator library manually, or it could happen
> > > >>>>>>> automatically based on a configuration option, basically "when
> > > >>>>>>> collator versions are miss-matched, invalidate and rebuild all the
> > > >>>>>>> views".
> > > >>>>>>>
> > > >>>>>>> 4) We'd have a way for the users to assert (POST a ddoc update) 
> > > >>>>>>> that
> > > >>>>>>> they double-checked the new ICU version and are convinced that a
> > > >>>>>>> particular view would not experience data loss with the new 
> > > >>>>>>> collator.
> > > >>>>>>> That should make the warning go away, and the view to not be 
> > > >>>>>>> rebuilt.
> > > >>>>>>> This can't be just a naive "collator" option setting as both 
> > > >>>>>>> per-view
> > > >>>>>>> and per-design options are used when computing the view 
> > > >>>>>>> signature, and
> > > >>>>>>> any changes there would result in the view being rebuilt. Perhaps 
> > > >>>>>>> we
> > > >>>>>>> can add it to the design docs as a separate option which is 
> > > >>>>>>> excluded
> > > >>>>>>> from the signature hash, like the "autoupdate" setting for 
> > > >>>>>>> background
> > > >>>>>>> index builder ("collation_version_accept"?). PostgreSQL also 
> > > >>>>>>> offers
> > > >>>>>>> this option with the ALTER COLLATION ... REFRESH VERSION command 
> > > >>>>>>> [3]
> > > >>>>>>>
> > > >>>>>>> What do we think, is this a reasonable approach? Is there 
> > > >>>>>>> something
> > > >>>>>>> easier / simpler we can do?
> > > >>>>>>>
> > > >>>>>>> Thanks!
> > > >>>>>>> -Nick
> > > >>>>>>>
> > > >>>>>>> [1] 
> > > >>>>>>> https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
> > > >>>>>>> [2] 
> > > >>>>>>> https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
> > > >>>>>>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> > > >>>>>
> > > >>>>
> > >

Reply via email to