Re: Reason for breaking display of لا

Thomas Guyot-Sionnest Wed, 28 Dec 2022 10:27:50 -0800

On 2022-12-28 04:49, Marc Lehmann wrote:

On Wed, Dec 28, 2022 at 10:46:58AM +0330, Avesta Sabayemoghadam 
<avestasabayemogha...@gmail.com> wrote:

character takes 2 bytes so normally لا is an array of two chars with the
size of 4 bytes. But لا has it's special Unicode value "U+FEFB" which takes

urxvt does not store characters in bytes, so this does not apply. urxvt
has no trouble storing that character, and the links you provided explain
that.


investigating the real issue is on our todo, but this is a complicated
problem, and at this point, urxvt does not support arabic
rendering/combining.


Isn't that like ligatures?

If we take the ligature "ﬁ" for instance, it can be de-normalized intoits individual components "f" and "i", but cannot be normalized back.The same is true for "ﻻ" (0xEF 0xBB 0xBB) which can be de-normalizedinto the two characters: "ل" (0xD9 0x84) and "ا" (0xD8 0xA7).

It appears the difference here is that these two characters are alwaysshown in their combined form as they're specific to Arabic script. I'msuspecting this is done by the font's ligatures as they still shows astwo characters, you can always get the cursor in between and pressspace, then you'll get the individual characters...


If you print the combined form, it works:

$ printf '\xEF\xBB\xBB\n'
ﻻ

I've also tested this a bit in a JavaScript console in my browser as I'mfamiliar with Unicode normalization and processing in JS...

> new TextEncoder().encode(new TextDecoder().decode(newUint8Array([0xEF, 0xBB, 0xBB])).normalize('NFC'))Uint8Array(3) [239, 187, 187, buffer: ArrayBuffer(3), byteLength: 3,byteOffset: 0, length: 3, Symbol(Symbol.toStringTag): 'Uint8Array']

> new TextEncoder().encode(new TextDecoder().decode(newUint8Array([0xEF, 0xBB, 0xBB])).normalize('NFKC'))Uint8Array(4) [217, 132, 216, 167, buffer: ArrayBuffer(4), byteLength:4, byteOffset: 0, length: 4, Symbol(Symbol.toStringTag): 'Uint8Array']

These decimal values are the same as the hex values above for the singleUnicode char (1x 24bit char, so 3x 8bit) and composing characters (2x16-bit chars, so 4x 8bit total). Note that you cannot go back to the3-bytes version after doing the NKFC normalization... You can find moreinfo about the normalization forms at https://unicode.org/reports/tr15/.

If you're working with de-normalized text it should be fairly simple towrite a filter that combines these two but I presume there's a lot moreligatures in Arabic that would have to be handled.

So, I'm not sure if there's an easy fix for that, maybe allowing fontligatures would suffice... In any case I think it should be done eitherat the source (combining into the proper code point) or through fontligatures/some other post-processing (I think this is better as youretain both individual characters in the text). IMHO it's not a Unicodeproblem as both individual and combined characters are printed correctlyalone...

FWIW I use FiraCode in urxvt and ligatures aren't shown - everywhereelse I use that font where ligatures works I get the combined form. As alast test I tried disabling ligatures in VS Code and it reverted to theindividual form, even slightly overlapped so that was even worse, so I'meven more convinced it's done by ligatures now...


Regards,

--
Thomas

_______________________________________________
rxvt-unicode mailing list
rxvt-unicode@lists.schmorp.de
http://lists.schmorp.de/mailman/listinfo/rxvt-unicode

Re: Reason for breaking display of لا

Reply via email to