On 5/5/23 8:25 PM, Jeff Davis wrote:
On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pg...@j-davis.com> wrote:
Most of the complaints seem to be complaints about v15 as well, and
while those complaints may be a reason to not make ICU the default,
they are also an argument that we should continue to learn and try
to
fix those issues because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather
than hide them.

It's still early, so we have plenty of time to revert the initdb
default if we need to.

That's fair enough, but I really think it's important that some
energy
get invested in providing adequate documentation for this stuff. Just
patching the code is not enough.

Attached a significant documentation patch.


I tried to make it comprehensive without trying to be exhaustive, and I
separated the explanation of language tags from what collation settings
you can include in a language tag, so hopefully that's more clear.

I added quite a few examples spread throughout the various sections,
and I preserved the existing examples at the end. I also left all of
the external links at the bottom for those interested enough to go
beyond what's there.

[Personal hat, not RMT]

Thanks -- this is super helpful. A bunch of these examples I had previously had to figure out by randomly searching blog posts / trial-and-error, so I think this will help developers get started more quickly.

Comments (and a lot are just little nits to tighten the language)

Commit message -- typo: "documentaiton"


+ If you see such a message, ensure that the <symbol>PROVIDER</symbol> and
+     <symbol>LOCALE</symbol> are as you expect, and consider specifying
+     directly as the canonical language tag instead of relying on the
+     transformation.
+    </para>

I'd recommend make this more prescriptive:

"If you see this notice, ensure that the <symbol>PROVIDER</symbol> and <symbol>LOCALE</symbol> are the expected result. For consistent results when using the ICU provider, specify the canonical <link linkend="icu-language-tag">language tag</link> instead of relying on the transformation."

+ If there is some problem interpreting the locale name, or if it represents + a language or region that ICU does not recognize, a message will be reported:

This is passive voice, consider:

"If there is a problem interpreting the locale name, or if the locale name represents a language or region that ICU does not recognize, you'll see the following error:"


+   <sect3 id="icu-language-tag">
+    <title>Language Tag</title>
+    <para>

Before jumping in, I'd recommend a quick definition of what a language tag is, e.g.:

"A language tag, defined in BCP 47, is a standardized identifier used to identify languages in computer systems" or something similar.

(I did find a database that made it simpler to search for these, which is one issue I've previously add, but I don't think we'd want to link to i)

+     To include this additional collation information in a language tag,
+     append <literal>-u</literal>, followed by one or more

My first question was "what's special about '-u'", so maybe we say:

"To include this additional collation information in a language tag, append <literal>-u</literal>, which indicates there are additional collation settings, followed by one or more..."

+ ICU locales are specified as a <link linkend="icu-language-tag">Language + Tag</link>, but can also accept most libc-style locale names (which will
+     be transformed into language tags if possible).
+    </para>

I'd recommend removing the parantheticals:

ICU locales are specified as a BCP 47 <link linkend="icu-language-tag">Language Tag</link>, but can also accept most libc-style locale names. If possible, libc-style locale names are transformed into language tags.

+      <title>ICU Collation Levels</title>

Nothing to add here other than to say I'm extremely appreciative of this section. Once upon a time I sunk a lot of time trying to figure out how all of these levels worked.

+          Sensitivity when determining equality, with
+          <literal>level1</literal> the least sensitive and
+          <literal>identic</literal> the most sensitive. See <xref
+          linkend="icu-collation-levels"/> for details.

This discusses equality sensitivity, but I'm not sure if I understand that term here. The ICU docs seem to call these "strengths"[1], maybe we use that term to be consistent with upstream?

+          If set to <literal>upper</literal>, upper case sorts before lower
+          case. If set to <literal>lower</literal>, lower case sorts before
+          upper case. If set to <literal>false</literal>, it depends on the
+          locale.

Suggestion to tighten this up:

"If set to <literal>false</literal>, the sort depends on the rules of the locale."

+      Defaults may depend on locale. The above table is not meant to be
+      complete. See <xref linkend="icu-external-references"/> for additinal
+      options and details.

Typo: additinal => "additional"

I didn't add additional documentation for ICU rules. There are so many
options for collations that it's hard for me to think of realistic
examples to specify the rules directly, unless someone wants to invent
a new language. Perhaps useful if working with an interesting text file
format with special treatment for delimiters?

I asked the question about rules here:

https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at

and got some limited response about addressing sort complaints. That
sounds reasonable, but a lot of that can also be handled just by
specifying the right collation settings. Someone who understands the
use case better could add some more documentation.

I'm not too sure about this one -- from my experience, users want predictability in sorts, but there are a variety of ways to get that experience.

Thanks,

Jonathan

[1] https://unicode-org.github.io/icu/userguide/collation/concepts.html

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to