On 2022-12-28 04:49, Marc Lehmann wrote:
On Wed, Dec 28, 2022 at 10:46:58AM +0330, Avesta Sabayemoghadam 
<avestasabayemogha...@gmail.com> wrote:
character takes 2 bytes so normally لا is an array of two chars with the
size of 4 bytes. But لا has it's special Unicode value "U+FEFB" which takes
urxvt does not store characters in bytes, so this does not apply. urxvt
has no trouble storing that character, and the links you provided explain
that.

investigating the real issue is on our todo, but this is a complicated
problem, and at this point, urxvt does not support arabic
rendering/combining.

Isn't that like ligatures?


If we take the ligature "fi" for instance, it can be de-normalized into its individual components "f" and "i", but cannot be normalized back. The same is true for "ﻻ" (0xEF 0xBB 0xBB) which can be de-normalized into the two characters: "ل" (0xD9 0x84) and "ا" (0xD8 0xA7).

It appears the difference here is that these two characters are always shown in their combined form as they're specific to Arabic script. I'm suspecting this is done by the font's ligatures as they still shows as two characters, you can always get the cursor in between and press space, then you'll get the individual characters...

If you print the combined form, it works:

$ printf '\xEF\xBB\xBB\n'
ﻻ


I've also tested this a bit in a JavaScript console in my browser as I'm familiar with Unicode normalization and processing in JS...

> new TextEncoder().encode(new TextDecoder().decode(new Uint8Array([0xEF, 0xBB, 0xBB])).normalize('NFC')) Uint8Array(3) [239, 187, 187, buffer: ArrayBuffer(3), byteLength: 3, byteOffset: 0, length: 3, Symbol(Symbol.toStringTag): 'Uint8Array']

> new TextEncoder().encode(new TextDecoder().decode(new Uint8Array([0xEF, 0xBB, 0xBB])).normalize('NFKC')) Uint8Array(4) [217, 132, 216, 167, buffer: ArrayBuffer(4), byteLength: 4, byteOffset: 0, length: 4, Symbol(Symbol.toStringTag): 'Uint8Array']

These decimal values are the same as the hex values above for the single Unicode char (1x 24bit char, so 3x 8bit) and composing characters (2x 16-bit chars, so 4x 8bit total). Note that you cannot go back to the 3-bytes version after doing the NKFC normalization... You can find more info about the normalization forms at https://unicode.org/reports/tr15/.


If you're working with de-normalized text it should be fairly simple to write a filter that combines these two but I presume there's a lot more ligatures in Arabic that would have to be handled.

So, I'm not sure if there's an easy fix for that, maybe allowing font ligatures would suffice... In any case I think it should be done either at the source (combining into the proper code point) or through font ligatures/some other post-processing (I think this is better as you retain both individual characters in the text). IMHO it's not a Unicode problem as both individual and combined characters are printed correctly alone...

FWIW I use FiraCode in urxvt and ligatures aren't shown - everywhere else I use that font where ligatures works I get the combined form. As a last test I tried disabling ligatures in VS Code and it reverted to the individual form, even slightly overlapped so that was even worse, so I'm even more convinced it's done by ligatures now...

Regards,

--
Thomas

_______________________________________________
rxvt-unicode mailing list
rxvt-unicode@lists.schmorp.de
http://lists.schmorp.de/mailman/listinfo/rxvt-unicode

Reply via email to