On 2022-12-28 04:49, Marc Lehmann wrote:
On Wed, Dec 28, 2022 at 10:46:58AM +0330, Avesta Sabayemoghadam
<avestasabayemogha...@gmail.com> wrote:
character takes 2 bytes so normally لا is an array of two chars with the
size of 4 bytes. But لا has it's special Unicode value "U+FEFB" which takes
urxvt does not store characters in bytes, so this does not apply. urxvt
has no trouble storing that character, and the links you provided explain
that.
investigating the real issue is on our todo, but this is a complicated
problem, and at this point, urxvt does not support arabic
rendering/combining.
Isn't that like ligatures?
If we take the ligature "fi" for instance, it can be de-normalized into
its individual components "f" and "i", but cannot be normalized back.
The same is true for "ﻻ" (0xEF 0xBB 0xBB) which can be de-normalized
into the two characters: "ل" (0xD9 0x84) and "ا" (0xD8 0xA7).
It appears the difference here is that these two characters are always
shown in their combined form as they're specific to Arabic script. I'm
suspecting this is done by the font's ligatures as they still shows as
two characters, you can always get the cursor in between and press
space, then you'll get the individual characters...
If you print the combined form, it works:
$ printf '\xEF\xBB\xBB\n'
ﻻ
I've also tested this a bit in a JavaScript console in my browser as I'm
familiar with Unicode normalization and processing in JS...
> new TextEncoder().encode(new TextDecoder().decode(new
Uint8Array([0xEF, 0xBB, 0xBB])).normalize('NFC'))
Uint8Array(3) [239, 187, 187, buffer: ArrayBuffer(3), byteLength: 3,
byteOffset: 0, length: 3, Symbol(Symbol.toStringTag): 'Uint8Array']
> new TextEncoder().encode(new TextDecoder().decode(new
Uint8Array([0xEF, 0xBB, 0xBB])).normalize('NFKC'))
Uint8Array(4) [217, 132, 216, 167, buffer: ArrayBuffer(4), byteLength:
4, byteOffset: 0, length: 4, Symbol(Symbol.toStringTag): 'Uint8Array']
These decimal values are the same as the hex values above for the single
Unicode char (1x 24bit char, so 3x 8bit) and composing characters (2x
16-bit chars, so 4x 8bit total). Note that you cannot go back to the
3-bytes version after doing the NKFC normalization... You can find more
info about the normalization forms at https://unicode.org/reports/tr15/.
If you're working with de-normalized text it should be fairly simple to
write a filter that combines these two but I presume there's a lot more
ligatures in Arabic that would have to be handled.
So, I'm not sure if there's an easy fix for that, maybe allowing font
ligatures would suffice... In any case I think it should be done either
at the source (combining into the proper code point) or through font
ligatures/some other post-processing (I think this is better as you
retain both individual characters in the text). IMHO it's not a Unicode
problem as both individual and combined characters are printed correctly
alone...
FWIW I use FiraCode in urxvt and ligatures aren't shown - everywhere
else I use that font where ligatures works I get the combined form. As a
last test I tried disabling ligatures in VS Code and it reverted to the
individual form, even slightly overlapped so that was even worse, so I'm
even more convinced it's done by ligatures now...
Regards,
--
Thomas
_______________________________________________
rxvt-unicode mailing list
rxvt-unicode@lists.schmorp.de
http://lists.schmorp.de/mailman/listinfo/rxvt-unicode