Re: About regex, charset, and localization.

EuAndreh via austin-group-l at The Open Group Thu, 29 May 2025 05:38:35 -0700

Localization in Unix was intended to sell the system to non-English-speaking
customers, but nowadays its relevance is decreasing due to the developement
of language models of deployable scales and improved translation algorithms -
although their accuracy is debated, they're sufficient considering they're
primarily just a first-hand built-in source, and users would purchase more 
professional
translation softwares or services for work.

TBH I didn't understand. Unix localization isn't relevant becauselanguage models can do translations? Even if the text is translated,don't you need a system to pick translations?

(...) as the true boundary
of a character is not always clear in every culture.


Isn't this unit a grapheme cluster?

I want to implement an ASCII-based regex that's simultaneously a byte-based 
regex,
POSIX didn't require me to use the exact ASCII character set, so in theory, I 
have the
freedom to call the byte values 128-255 [:nonchar:] or [:nonascii:] if I see fit.

You don't *have* to restrict yourself to ASCII to have[:nonchar:]/[:nonascii:].

But in
this case, I strictly shouldn't advertise charset as ASCII in my environment, 
yet
programs that sees ASCII can assume some properties about the environment, but
such assumption will in turn make them strictly non-portable?


I don't think programs can assume that.

How do you view these issues? Thanks for your opinion.

Go for UTF-8, which is compatible with ASCII, and add 128-255 to thatcustom character class. If you ever want to expand portability, UTF-8will be the way yo go.

Re: About regex, charset, and localization.

Reply via email to