Hello Rob, others,
On 2025-11-03 08:28, Rob Sayre wrote:
On 11/2/25 5:03 AM, Pete Resnick wrote:
On 31 Oct 2025, at 7:57, Martin J. Dürst wrote:
On 2025-10-29 09:33, Paul Hoffman wrote:
On Oct 28, 2025, at 01:35, Martin J. Dürst <[email protected]>
wrote:
Content, major: Section 3: "There are many Unicode characters that
obviously cannot be displayed (such as control characters), and
many whose ability to be displayed is debatable.": It's unclear
what "many whose ability to be displayed is debatable." means. I'd
guess it refers to scripts and characters standardized recently,
for which font support is still thin. If that's what is meant,
please say so; if something else is meant, please make clear what
that is.
There is a wide variety of things that can be debatable. Are
combining characters like U+0315 (COMBINING COMMA ABOVE RIGHT)
displayable? What about non-spacing marks like U+0650 (ARABIC
KASRA)? I am sure people would take each side of the debate ("I can
see the symbol printed in the Unicode Standard" vs. "I can't see
that code point on my laptop even though it has quite a complete
font set" and so on).
On any decent browser, these should display without problems. When it
comes to editors, shells, and the like, the field is much wider, so
there are no absolute guarantees. But these are in Unicode since
Unicode 1.0 or so, so I would expect these to show.
I will leave it to you and Paul to replace "debatable" with something
clearer.
Hi,
There is an entire RFC about this, which Paul co-wrote.
https://www.rfc-editor.org/rfc/rfc9839.html
Last time I checked, none of the characters excluded in any of the sets
defined in RFC 9839 had any chance whatsoever to turn up in names of
people or companies or places.
What you may be missing is that social networks have character counts,
and they sure do go after these issues.
These systems do in fact count a "family" as one character, not
multiples with ZWNJs. Once you understand that, it gets a little cleaner.
I know. At a Unicode Conference many years back, I learned (directly
from the person who initiated that change) that Twitter had switched
from counting bytes to counting code points, which was the first step in
that direction.
But we are currently not looking at writing policy about length
restrictions, so I think this is irrelevant. [It's also irrelevant
because of the low (=zero?) likeliness of somebody having a family
emoji, or any emoji for that, in their name.]
Regards, Martin.
I wrote it:
https://github.com/sayrer/twitter-text/blob/main/rust/parser/src/
twitter_text.pest#L344
So, having written code that says:
"// Zombies, genies, dancers, and wrestlers"
I am a little tired of these discussions.
But I have it in a coherent (PEST) grammar. The tough problems are URLs
with no protocol and languages that do not require whitespace. So, if
you click that link, look at "URL Without Protocol".
I am down (down desu) to really go after this issue, but it is
difficult. Mine is the best so far, though.
thanks,
Rob
--
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]