Bug#1039503: wrongly splits up extended grapheme clusters (like certain emoji)

Steinar H. Gunderson Mon, 26 Jun 2023 11:27:16 -0700

Package: screen
Version: 4.9.0-4
Severity: normal
Tags: upstream

Hi,


I was trying to figure out why irssi sometimes garbles the display when certain
emoji are involved in the channel topic; after some debugging, it seems the 
issue
is with screen, not irssi. To reproduce, start up screen and do (in a shell)

  echo '\0360\0237\0217\0263\0357\0270\0217\0342\0200\0215\0360\0237\0214\0210'

(That is UTF-8 for U+1F3F3 U+FE0F U+200D U+1F308.)

Outside of screen, I correctly get 🏳️‍🌈, , RAINBOW FLAG. However, inside
screen, I get 🏳️, space, 🌈 (WHITE FLAG, space, RAINBOW). Curiously enough,
if I switch windows and get an irssi redraw, it actually gets drawn correctly,
but that's probably something more complex.

I've tried following the code, and this is my understanding of what's going on:
screen really wants one cell = one codepoint, yet it needs to support combining
characters. So internally, it seems to keep a sort of cache of combining 
sequences,
using the otherwise-reserved-for-surrogates range U+D800..U+DFFF.

This seems to happen in two steps; utf8_handle_comb() (in encoding.c) adds more
code points to a given sequence, allocating points and keeping a linked list.
(There's some confusion in that the “font” member of struct mchar is used to
hold the upper 8 bits of the resulting surrogate, but I think that's just some
sort of hack because the “image” member is 8-bit only? And there's something
about double-wide characters that I don't fully understand.) Then, at display
time, ToUtf8_comb() follows this linked list back to output the entire sequence.

The screen debug log appears to go through this (I've kept only what I think
are relevant messages):

  read UNICODE 1f3f3
  read UNICODE fe0f
  combinig char 1f3f3 fe0f -> d800
  bring to front: 0
  GotoPos (1,1) -> (0,1)
  ---LGotoPos 1 1
  read UNICODE 200d
  combinig char d800 200d -> d801
  bring to front: 1
  bring to front: 0
  GotoPos (1,1) -> (0,1)
  ---LGotoPos 1 1
  read UNICODE 1f308
  ---LGotoPos 0 1

Seemingly, it understands that the two first codepoints are to be combined,
allocates U+D800 for that, and then continues reading. Then it reads
the third one, combines it with U+D800 to create U+D801, but then mistakenly
does _not_ combine U+1F308 with it. This is why we end up with two different
things on screen.

It really seems to me that screen simply doesn't understand Unicode extended
grapheme clusters, only the legacy “legacy grapheme clusters”; from UAX#29
(https://unicode.org/reports/tr29/):

  A legacy grapheme cluster is defined as a base (such as A or カ) followed by
  zero or more continuing characters. One way to think of this is as a sequence
  of characters that form a “stack”.

This seems to come from ansi.c line 705 (WriteString()):

  if (curr->w_encoding == UTF8 && c >= 0x0300 && utf8_iscomb(c))
   {
     […]
     utf8_handle_comb(c, &omc);

In other words, it thinks certain code points are inherently combining
and thus are to be attached to the previous character. However, that's
changed since Unicode 5.1.0, circa 2008; now all of these four together
should form an extended grapheme cluster.

The rules for extended grapheme clusters are locale-dependent, but I'd
guess screen would do just fine with the default grapheme clusters
(certainly much better than today). The rules are actually pretty simple,
if a tad verbose:

https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules

In particular, I would believe what we need here is rule GB11,
“Do not break within emoji modifier sequences or emoji zwj sequences.”:

  \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic}

where “×” means “don't break here”, i.e., continue to run
utf8_handle_comb(). Specifically, the rule matches because:

  U+1F3F3 WAVING WHITE FLAG is indeed Extended_Pictographic
          (according to 
https://unicode.org/Public/15.0.0/ucd/emoji/emoji-data.txt)
  U+FE0F  VARIATION SELECTOR-16 is Extend, because it is Grapheme_Extend
          (according to 
https://unicode.org/Public/15.0.0/ucd/DerivedCoreProperties.txt)
  U+200D  ZERO-WIDTH JOINER is, well, ZWJ
  U+1F308 RAINBOW is also Extended_Pictographic (same file)

Is it possible to retrofit these rules? This specific rule would seem to
hit a lot of modern emoji sequences (the Unicode Consortium seems to prefer
using such sequences instead of defining new code points where possible);
I would assume including the full table of grapheme clusters would help
e.g. Korean text, too, although I cannot read Korean and have no idea
whether it's actually a problem.

As a hack, it seems that anything that follows a ZWJ would be very likely
to keep being part of the same grapheme cluster, but I haven't tested this
out in practice.

FWIW, tmux seems to have no problems showing the flag, although I haven't
checked its implementation.

-- Package-specific info:
File Existence and Permissions
------------------------------

drwxr-xr-x 34 root root   1140 Jun 26 17:15 /run
lrwxrwxrwx  1 root root      4 May  5  2013 /var/run -> /run
-rwxr-xr-x  1 root root 482392 Jan  9 04:56 /usr/bin/screen
-rw-r--r--  1 root root     29 Dec 20  2018 /etc/tmpfiles.d/screen-cleanup.conf
lrwxrwxrwx  1 root root      9 May 21  2017 
/lib/systemd/system/screen-cleanup.service -> /dev/null
-rwxr-xr-x  1 root root   1222 Apr  3  2017 /etc/init.d/screen-cleanup
lrwxrwxrwx  1 root root     24 May 21  2017 /etc/rcS.d/S19screen-cleanup -> 
../init.d/screen-cleanup

File contents
-------------

### /etc/tmpfiles.d/screen-cleanup.conf
______________________________________________________________________
d /run/screen 1777 root utmp
______________________________________________________________________

-- System Information:
Debian Release: 12.0
  APT prefers stable-security
  APT policy: (500, 'stable-security'), (500, 'stable-debug'), (500, 
'proposed-updates'), (500, 'oldstable-security'), (500, 
'oldstable-proposed-updates'), (500, 'stable'), (500, 'oldstable'), (1, 
'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 6.3.9 (SMP w/56 CPU threads; PREEMPT)
Locale: LANG=en_DK.UTF-8, LC_CTYPE=en_DK.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_NO:en_US:en_GB:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages screen depends on:
ii  debianutils   5.7-0.4
ii  libc6         2.36-9
ii  libcrypt1     1:4.4.33-2
ii  libpam0g      1.5.2-6
ii  libtinfo6     6.4-4
ii  libutempter0  1.2.1-3

screen recommends no packages.

Versions of packages screen suggests:
pn  byobu | screenie | iselect  <none>
ii  ncurses-term                6.4-4

-- debconf-show failed

Bug#1039503: wrongly splits up extended grapheme clusters (like certain emoji)

Reply via email to