Yeah, piggybacking on what Weston said: is the line that we want to draw is
code point, combining character sequences, or graphemes [1]. IME, most
people would want/assume that combining characters would stay combined in
reversals (using Weston's example: "tréma" becoming "aḿert" (though this
specific character "é" has a combining version e+U+0300 and a single code
point é, and for many diacritics from different writing systems there is
only the combining version).

But whatever division we choose, documentation + links to explanations are
great.

[1]
https://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html#unicode.introduction_to_unicode.notion_of_character
there's also discussion at https://unicode.org/reports/tr29/, though the
first link I found much clearer.

On Mon, May 17, 2021 at 10:46 AM Weston Pace <weston.p...@gmail.com> wrote:

> FWIW, combining marks were not actually added to support emojis.  Emojis
> are just one of the more popular uses of the feature.  Combining marks is a
> standard Unicode feature necessary to represent single “characters” in some
> complex situations (e.g. when it is necessary to distinguish between tréma
> and umlaut, or to represent certain characters in Navajo).
>
> That being said I agree with the conclusions.  It’s ok to leave out for now
> and no need to link to any docs.
>
> On Mon, May 17, 2021 at 5:31 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > I'm fine with pointing out that the function operates on codepoints.
> >
> > Linking to the Unicode documentation for emojis sounds entirely like a
> > distraction, though.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 17/05/2021 à 17:28, Ian Cook a écrit :
> > > +1 for clarifying this in the kernel documentation, referring to these
> > > multi-emoji glyphs as "emoji ZWJ sequences," and linking to
> > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > >
> > > Ian
> > >
> > >
> > > On Mon, May 17, 2021 at 11:21 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >>
> > >>
> > >> Le 17/05/2021 à 17:17, David Li a écrit :
> > >>> A little clarification on my point: it's not that a single codepoint
> > >>> gets encoded with more than four bytes, it's that a grapheme
> > >>> cluster/human-delimited 'character' might be multiple codepoints, so
> > >>> reversing the individual codepoints may produce an unexpected
> > >>> result. For instance a flag emoji is actually two codepoints (two
> > >>> special 'letter' codepoints that represent the country code), so
> > >>> reversing a US flag naively will give you an odd '[SU]' instead.
> > >>
> > >> This sounds like saying that reversing a valid French word does not
> > >> produce a valid French word (well, in most cases). The kernel
> > >> documentation can't contain an entire tutorial about Unicode
> characters
> > >> and what to expect from them, IMHO.
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> >
>

Reply via email to