On 11/2/25 5:03 AM, Pete Resnick wrote:
On 31 Oct 2025, at 7:57, Martin J. Dürst wrote:

On 2025-10-29 09:33, Paul Hoffman wrote:

On Oct 28, 2025, at 01:35, Martin J. Dürst <[email protected]> wrote:

Content, major: Section 3: "There are many Unicode characters that obviously cannot be displayed (such as control characters), and many whose ability to be displayed is debatable.": It's unclear what "many whose ability to be displayed is debatable." means. I'd guess it refers to scripts and characters standardized recently, for which font support is still thin. If that's what is meant, please say so; if something else is meant, please make clear what that is.

There is a wide variety of things that can be debatable. Are combining characters like U+0315 (COMBINING COMMA ABOVE RIGHT) displayable? What about non-spacing marks like U+0650 (ARABIC KASRA)? I am sure people would take each side of the debate ("I can see the symbol printed in the Unicode Standard" vs. "I can't see that code point on my laptop even though it has quite a complete font set" and so on).

On any decent browser, these should display without problems. When it comes to editors, shells, and the like, the field is much wider, so there are no absolute guarantees. But these are in Unicode since Unicode 1.0 or so, so I would expect these to show.

I will leave it to you and Paul to replace "debatable" with something clearer.



Hi,

There is an entire RFC about this, which Paul co-wrote.

https://www.rfc-editor.org/rfc/rfc9839.html

What you may be missing is that social networks have character counts, and they sure do go after these issues.

These systems do in fact count a "family" as one character, not multiples with ZWNJs. Once you understand that, it gets a little cleaner.

I wrote it:

https://github.com/sayrer/twitter-text/blob/main/rust/parser/src/twitter_text.pest#L344

So, having written code that says:

"// Zombies, genies, dancers, and wrestlers"

I am a little tired of these discussions.

But I have it in a coherent (PEST) grammar. The tough problems are URLs with no protocol and languages that do not require whitespace. So, if you click that link, look at "URL Without Protocol".

I am down (down desu) to really go after this issue, but it is difficult. Mine is the best so far, though.

thanks,
Rob

--
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to