Do any of us still have recount of our interaction with WG20 - Internationalization?
I intend to learn more about the background before doing any kind of judgment. > 2025年5月25日 07:58,Niu Danny via austin-group-l at The Open Group > <[email protected]> 写道: > > I'd like to query if the following premise makes the following > implementation stratagy valid. Not sure if it's on topic though. > > Localization in Unix was intended to sell the system to non-English-speaking > customers, but nowadays its relevance is decreasing due to the developement > of language models of deployable scales and improved translation algorithms - > although their accuracy is debated, they're sufficient considering they're > primarily just a first-hand built-in source, and users would purchase more > professional > translation softwares or services for work. > > Internationalized regex is supposedly a subsidiary tool to localization for > text processing, but for a regex engine to be really internationalized, I > think > a character database model is needed, which is easy, as the true boundary > of a character is not always clear in every culture. I suppose the readers > will > expect Perl to be mentioned, so yes, a large codebase of text processing tool > is written in Perl, owing to its more versatile regex and programming language > syntax, as well as its diverse ecosystem. > > Regex in Unix really is mostly good for system administration - especially for > tasks that are meant to be automated such as log analysis and incident > reports. > Configuration editing and other tasks that require humen decision, although > cannot > be automated, can be greatly augmented when a useful tool such as regex is > available to user. > > I personally find another use of regex where localization prevented me from > doing what I need. In web back-end programming, there's the need of > "path sanitization" when storing and retrieving files, to prevent malicious > client from using crafted path to overwrite or accesss restricted data. > Due to the regex engine I used at time bundled with internationalization > support, > I had to install additional dependency during deployment, which wasn't > discovered during development. Minor anecdote though. > > POSIX already give permit for implementation to support no additional locales > than > the C/POSIX locale, so a regex implementation that hasn't any extension > mechanism > whatsoever, on a system implementation that doesn't support defining > additional > locales is conforming. But here's the part that I'm not sure: > > I want to implement an ASCII-based regex that's simultaneously a byte-based > regex, > POSIX didn't require me to use the exact ASCII character set, so in theory, I > have the > freedom to call the byte values 128-255 [:nonchar:] or [:nonascii:] if I see > fit. But in > this case, I strictly shouldn't advertise charset as ASCII in my environment, > yet > programs that sees ASCII can assume some properties about the environment, but > such assumption will in turn make them strictly non-portable? > > How do you view these issues? Thanks for your opinion. >
