Re: About regex, charset, and localization.

[email protected] via austin-group-l at The Open Group Fri, 20 Jun 2025 04:24:28 -0700

the wg20 issues are now tranferred to sc35/wg5
they issued a standard iso/iec 30112 - built on posix, c and unicode


keld

On Fri, Jun 20, 2025 at 08:54:47AM +0000, Niu Danny via austin-group-l at The 
Open Group wrote:
> Do any of us still have recount of our interaction 
> with WG20 - Internationalization? 
> 
> I intend to learn more about the background 
> before doing any kind of judgment.
> 
> > 2025???5???25??? 07:58???Niu Danny via austin-group-l at The Open Group 
> > <[email protected]> ?????????
> > 
> > I'd like to query if the following premise makes the following 
> > implementation stratagy valid. Not sure if it's on topic though.
> > 
> > Localization in Unix was intended to sell the system to non-English-speaking
> > customers, but nowadays its relevance is decreasing due to the developement
> > of language models of deployable scales and improved translation algorithms 
> > -
> > although their accuracy is debated, they're sufficient considering they're 
> > primarily just a first-hand built-in source, and users would purchase more 
> > professional
> > translation softwares or services for work.
> > 
> > Internationalized regex is supposedly a subsidiary tool to localization for
> > text processing, but for a regex engine to be really internationalized, I 
> > think
> > a character database model is needed, which is easy, as the true boundary
> > of a character is not always clear in every culture. I suppose the readers 
> > will
> > expect Perl to be mentioned, so yes, a large codebase of text processing 
> > tool
> > is written in Perl, owing to its more versatile regex and programming 
> > language
> > syntax, as well as its diverse ecosystem.
> > 
> > Regex in Unix really is mostly good for system administration - especially 
> > for
> > tasks that are meant to be automated such as log analysis and incident 
> > reports.
> > Configuration editing and other tasks that require humen decision, although 
> > cannot
> > be automated, can be greatly augmented when a useful tool such as regex is 
> > available to user.
> > 
> > I personally find another use of regex where localization prevented me from 
> > doing what I need. In web back-end programming, there's the need of 
> > "path sanitization" when storing and retrieving files, to prevent malicious 
> > client from using crafted path to overwrite or accesss restricted data.
> > Due to the regex engine I used at time bundled with internationalization 
> > support,
> > I had to install additional dependency during deployment, which wasn't
> > discovered during development. Minor anecdote though.
> > 
> > POSIX already give permit for implementation to support no additional 
> > locales than
> > the C/POSIX locale, so a regex implementation that hasn't any extension 
> > mechanism
> > whatsoever, on a system implementation that doesn't support defining 
> > additional
> > locales is conforming. But here's the part that I'm not sure:
> > 
> > I want to implement an ASCII-based regex that's simultaneously a byte-based 
> > regex,
> > POSIX didn't require me to use the exact ASCII character set, so in theory, 
> > I have the
> > freedom to call the byte values 128-255 [:nonchar:] or [:nonascii:] if I 
> > see fit. But in
> > this case, I strictly shouldn't advertise charset as ASCII in my 
> > environment, yet 
> > programs that sees ASCII can assume some properties about the environment, 
> > but
> > such assumption will in turn make them strictly non-portable?
> > 
> > How do you view these issues? Thanks for your opinion.
> > 
>

Re: About regex, charset, and localization.

Reply via email to