Re: About regex, charset, and localization.

Niu Danny via austin-group-l at The Open Group Fri, 20 Jun 2025 02:07:03 -0700

Do any of us still have recount of our interaction 
with WG20 - Internationalization?


I intend to learn more about the background 
before doing any kind of judgment.

> 2025年5月25日 07:58，Niu Danny via austin-group-l at The Open Group 
> <[email protected]> 写道：
> 
> I'd like to query if the following premise makes the following 
> implementation stratagy valid. Not sure if it's on topic though.
> 
> Localization in Unix was intended to sell the system to non-English-speaking
> customers, but nowadays its relevance is decreasing due to the developement
> of language models of deployable scales and improved translation algorithms -
> although their accuracy is debated, they're sufficient considering they're 
> primarily just a first-hand built-in source, and users would purchase more 
> professional
> translation softwares or services for work.
> 
> Internationalized regex is supposedly a subsidiary tool to localization for
> text processing, but for a regex engine to be really internationalized, I 
> think
> a character database model is needed, which is easy, as the true boundary
> of a character is not always clear in every culture. I suppose the readers 
> will
> expect Perl to be mentioned, so yes, a large codebase of text processing tool
> is written in Perl, owing to its more versatile regex and programming language
> syntax, as well as its diverse ecosystem.
> 
> Regex in Unix really is mostly good for system administration - especially for
> tasks that are meant to be automated such as log analysis and incident 
> reports.
> Configuration editing and other tasks that require humen decision, although 
> cannot
> be automated, can be greatly augmented when a useful tool such as regex is 
> available to user.
> 
> I personally find another use of regex where localization prevented me from 
> doing what I need. In web back-end programming, there's the need of 
> "path sanitization" when storing and retrieving files, to prevent malicious 
> client from using crafted path to overwrite or accesss restricted data.
> Due to the regex engine I used at time bundled with internationalization 
> support,
> I had to install additional dependency during deployment, which wasn't
> discovered during development. Minor anecdote though.
> 
> POSIX already give permit for implementation to support no additional locales 
> than
> the C/POSIX locale, so a regex implementation that hasn't any extension 
> mechanism
> whatsoever, on a system implementation that doesn't support defining 
> additional
> locales is conforming. But here's the part that I'm not sure:
> 
> I want to implement an ASCII-based regex that's simultaneously a byte-based 
> regex,
> POSIX didn't require me to use the exact ASCII character set, so in theory, I 
> have the
> freedom to call the byte values 128-255 [:nonchar:] or [:nonascii:] if I see 
> fit. But in
> this case, I strictly shouldn't advertise charset as ASCII in my environment, 
> yet 
> programs that sees ASCII can assume some properties about the environment, but
> such assumption will in turn make them strictly non-portable?
> 
> How do you view these issues? Thanks for your opinion.
>

Re: About regex, charset, and localization.

Reply via email to