Package: screen Version: 4.9.0-4 Severity: normal Tags: upstream Hi,
I was trying to figure out why irssi sometimes garbles the display when certain emoji are involved in the channel topic; after some debugging, it seems the issue is with screen, not irssi. To reproduce, start up screen and do (in a shell) echo '\0360\0237\0217\0263\0357\0270\0217\0342\0200\0215\0360\0237\0214\0210' (That is UTF-8 for U+1F3F3 U+FE0F U+200D U+1F308.) Outside of screen, I correctly get 🏳️🌈, , RAINBOW FLAG. However, inside screen, I get 🏳️, space, 🌈 (WHITE FLAG, space, RAINBOW). Curiously enough, if I switch windows and get an irssi redraw, it actually gets drawn correctly, but that's probably something more complex. I've tried following the code, and this is my understanding of what's going on: screen really wants one cell = one codepoint, yet it needs to support combining characters. So internally, it seems to keep a sort of cache of combining sequences, using the otherwise-reserved-for-surrogates range U+D800..U+DFFF. This seems to happen in two steps; utf8_handle_comb() (in encoding.c) adds more code points to a given sequence, allocating points and keeping a linked list. (There's some confusion in that the “font” member of struct mchar is used to hold the upper 8 bits of the resulting surrogate, but I think that's just some sort of hack because the “image” member is 8-bit only? And there's something about double-wide characters that I don't fully understand.) Then, at display time, ToUtf8_comb() follows this linked list back to output the entire sequence. The screen debug log appears to go through this (I've kept only what I think are relevant messages): read UNICODE 1f3f3 read UNICODE fe0f combinig char 1f3f3 fe0f -> d800 bring to front: 0 GotoPos (1,1) -> (0,1) ---LGotoPos 1 1 read UNICODE 200d combinig char d800 200d -> d801 bring to front: 1 bring to front: 0 GotoPos (1,1) -> (0,1) ---LGotoPos 1 1 read UNICODE 1f308 ---LGotoPos 0 1 Seemingly, it understands that the two first codepoints are to be combined, allocates U+D800 for that, and then continues reading. Then it reads the third one, combines it with U+D800 to create U+D801, but then mistakenly does _not_ combine U+1F308 with it. This is why we end up with two different things on screen. It really seems to me that screen simply doesn't understand Unicode extended grapheme clusters, only the legacy “legacy grapheme clusters”; from UAX#29 (https://unicode.org/reports/tr29/): A legacy grapheme cluster is defined as a base (such as A or カ) followed by zero or more continuing characters. One way to think of this is as a sequence of characters that form a “stack”. This seems to come from ansi.c line 705 (WriteString()): if (curr->w_encoding == UTF8 && c >= 0x0300 && utf8_iscomb(c)) { […] utf8_handle_comb(c, &omc); In other words, it thinks certain code points are inherently combining and thus are to be attached to the previous character. However, that's changed since Unicode 5.1.0, circa 2008; now all of these four together should form an extended grapheme cluster. The rules for extended grapheme clusters are locale-dependent, but I'd guess screen would do just fine with the default grapheme clusters (certainly much better than today). The rules are actually pretty simple, if a tad verbose: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules In particular, I would believe what we need here is rule GB11, “Do not break within emoji modifier sequences or emoji zwj sequences.”: \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic} where “×” means “don't break here”, i.e., continue to run utf8_handle_comb(). Specifically, the rule matches because: U+1F3F3 WAVING WHITE FLAG is indeed Extended_Pictographic (according to https://unicode.org/Public/15.0.0/ucd/emoji/emoji-data.txt) U+FE0F VARIATION SELECTOR-16 is Extend, because it is Grapheme_Extend (according to https://unicode.org/Public/15.0.0/ucd/DerivedCoreProperties.txt) U+200D ZERO-WIDTH JOINER is, well, ZWJ U+1F308 RAINBOW is also Extended_Pictographic (same file) Is it possible to retrofit these rules? This specific rule would seem to hit a lot of modern emoji sequences (the Unicode Consortium seems to prefer using such sequences instead of defining new code points where possible); I would assume including the full table of grapheme clusters would help e.g. Korean text, too, although I cannot read Korean and have no idea whether it's actually a problem. As a hack, it seems that anything that follows a ZWJ would be very likely to keep being part of the same grapheme cluster, but I haven't tested this out in practice. FWIW, tmux seems to have no problems showing the flag, although I haven't checked its implementation. -- Package-specific info: File Existence and Permissions ------------------------------ drwxr-xr-x 34 root root 1140 Jun 26 17:15 /run lrwxrwxrwx 1 root root 4 May 5 2013 /var/run -> /run -rwxr-xr-x 1 root root 482392 Jan 9 04:56 /usr/bin/screen -rw-r--r-- 1 root root 29 Dec 20 2018 /etc/tmpfiles.d/screen-cleanup.conf lrwxrwxrwx 1 root root 9 May 21 2017 /lib/systemd/system/screen-cleanup.service -> /dev/null -rwxr-xr-x 1 root root 1222 Apr 3 2017 /etc/init.d/screen-cleanup lrwxrwxrwx 1 root root 24 May 21 2017 /etc/rcS.d/S19screen-cleanup -> ../init.d/screen-cleanup File contents ------------- ### /etc/tmpfiles.d/screen-cleanup.conf ______________________________________________________________________ d /run/screen 1777 root utmp ______________________________________________________________________ -- System Information: Debian Release: 12.0 APT prefers stable-security APT policy: (500, 'stable-security'), (500, 'stable-debug'), (500, 'proposed-updates'), (500, 'oldstable-security'), (500, 'oldstable-proposed-updates'), (500, 'stable'), (500, 'oldstable'), (1, 'experimental') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 6.3.9 (SMP w/56 CPU threads; PREEMPT) Locale: LANG=en_DK.UTF-8, LC_CTYPE=en_DK.UTF-8 (charmap=UTF-8), LANGUAGE=en_NO:en_US:en_GB:en Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) Versions of packages screen depends on: ii debianutils 5.7-0.4 ii libc6 2.36-9 ii libcrypt1 1:4.4.33-2 ii libpam0g 1.5.2-6 ii libtinfo6 6.4-4 ii libutempter0 1.2.1-3 screen recommends no packages. Versions of packages screen suggests: pn byobu | screenie | iselect <none> ii ncurses-term 6.4-4 -- debconf-show failed