(sorry for the late reply to this thread) On Thu 2019-02-21 15:11:48 -0400, David Bremner wrote: > to be unique case-insensitively, so I decided to convert them to lower > case on input. This turns out to be "fun", if we try to handle things > other than ASCII. So one option is to just insist prefixes are ASCII. > > Otherwise we could insist they are UTF-8, ignoring the locale. The > fullest generality (I think) is to first convert from the users locale > to utf8, as in the attached sample program.
I don't think this discussion fully covers just how "fun" this conversion is. Even if we assume UTF-8 in the database (which i think we should), making something all lower-case is locale-dependent. The classic example, iirc, is that in most UTF-8 locales, U+0049 LATIN CAPITAL LETTER I downcases to U+0069 LATIN SMALL LETTER I, but in tr_TR (Turkish), it downcases to U+0131 LATIN SMALL LETTER DOTLESS I. (and upper-casing U+0069 LATIN SMALL LETTER I in tr_TR yields U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE) Similarly, if there's anything that the DB cares about collation for, that also varies dramatically across UTF-8 locales. sigh. I have no problem with asserting that all character strings in the notmuch database are UTF-8. That's just the only sane thing to do in 2019. But if we build any feature into notmuch that makes assumptions or requirements about upper-casing, lower-casing, or collating strings, and that feature interacts between the currently-running locale and whatever locale was used to store data in the the database in the past, and those locales can differ, we may be inflicting some subtle pain on users. (note that i'm assuming in this discussion that we're *just* talking about metadata -- notmuch configuration options, explicit xapian terms, etc, but *not* the indexed text of the messages, which is an entirely different kettle of fish) I see two protective approaches for handling this simply yet being clear about our concerns. Both methods introduce a clear dependency on some UTF-8 locale, in the way that we also have clear dependencies on GMime or Xapian. a) assert that all text strings in the notmuch db's metadata are C.UTF-8, and enforce this explicitly in the codebase. or, b) upon database initialization, select a UTF-8 locale (probably based on the user's locale during "notmuch setup") and store it in the database (perhaps reporting and displaying it via a "notmuch config" value). If any locale-dependent function is used against in-database metadata while a *different* locale is active in the environment, warn that this mismatch is happening, and prefer the locale stored in the db. I don't have the capacity to work on this kind of safeguard right now, but someone who wants to learn more about locales and notmuch could try to implement it and we could see what happens. Being explicit about the concern like this might help to raise the profile of the specific risky codepaths, which in turn could prompt someone to make a more sophisticated and useful fix than either of the guardrails described above. --dkg
signature.asc
Description: PGP signature
_______________________________________________ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch