Localization in Unix was intended to sell the system to non-English-speaking
customers, but nowadays its relevance is decreasing due to the developement
of language models of deployable scales and improved translation algorithms -
although their accuracy is debated, they're sufficient considering they're
primarily just a first-hand built-in source, and users would purchase more
professional
translation softwares or services for work.
TBH I didn't understand. Unix localization isn't relevant because
language models can do translations? Even if the text is translated,
don't you need a system to pick translations?
(...) as the true boundary
of a character is not always clear in every culture.
Isn't this unit a grapheme cluster?
I want to implement an ASCII-based regex that's simultaneously a byte-based
regex,
POSIX didn't require me to use the exact ASCII character set, so in theory, I
have the
freedom to call the byte values 128-255 [:nonchar:] or [:nonascii:] if I see fit.
You don't *have* to restrict yourself to ASCII to have
[:nonchar:]/[:nonascii:].
But in
this case, I strictly shouldn't advertise charset as ASCII in my environment,
yet
programs that sees ASCII can assume some properties about the environment, but
such assumption will in turn make them strictly non-portable?
I don't think programs can assume that.
How do you view these issues? Thanks for your opinion.
Go for UTF-8, which is compatible with ASCII, and add 128-255 to that
custom character class. If you ever want to expand portability, UTF-8
will be the way yo go.