On 11/2/25 5:03 AM, Pete Resnick wrote:
On 31 Oct 2025, at 7:57, Martin J. Dürst wrote:
On 2025-10-29 09:33, Paul Hoffman wrote:
On Oct 28, 2025, at 01:35, Martin J. Dürst <[email protected]>
wrote:
Content, major: Section 3: "There are many Unicode characters that
obviously cannot be displayed (such as control characters), and many
whose ability to be displayed is debatable.": It's unclear what
"many whose ability to be displayed is debatable." means. I'd guess
it refers to scripts and characters standardized recently, for which
font support is still thin. If that's what is meant, please say so;
if something else is meant, please make clear what that is.
There is a wide variety of things that can be debatable. Are
combining characters like U+0315 (COMBINING COMMA ABOVE RIGHT)
displayable? What about non-spacing marks like U+0650 (ARABIC KASRA)?
I am sure people would take each side of the debate ("I can see the
symbol printed in the Unicode Standard" vs. "I can't see that code
point on my laptop even though it has quite a complete font set" and
so on).
On any decent browser, these should display without problems. When it
comes to editors, shells, and the like, the field is much wider, so
there are no absolute guarantees. But these are in Unicode since
Unicode 1.0 or so, so I would expect these to show.
I will leave it to you and Paul to replace "debatable" with something
clearer.
Hi,
There is an entire RFC about this, which Paul co-wrote.
https://www.rfc-editor.org/rfc/rfc9839.html
What you may be missing is that social networks have character counts,
and they sure do go after these issues.
These systems do in fact count a "family" as one character, not
multiples with ZWNJs. Once you understand that, it gets a little cleaner.
I wrote it:
https://github.com/sayrer/twitter-text/blob/main/rust/parser/src/twitter_text.pest#L344
So, having written code that says:
"// Zombies, genies, dancers, and wrestlers"
I am a little tired of these discussions.
But I have it in a coherent (PEST) grammar. The tough problems are URLs
with no protocol and languages that do not require whitespace. So, if
you click that link, look at "URL Without Protocol".
I am down (down desu) to really go after this issue, but it is
difficult. Mine is the best so far, though.
thanks,
Rob
--
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]