Re: new string module

Mauro Tortonesi Mon, 10 Jan 2005 10:50:36 -0800

Alle 22:45, mercoledÃ 5 gennaio 2005, hai scritto:
> On Wednesday 05 January 2005 02:46, Jan Minar wrote:
> > What's wrong with mbrtowc(3) and friends?  The mysterious solution is
> > probably to use wprintf(3) instead printf(3).  Couple of questions on #c
> > on freenode would give you that answer.
>
> Hello everyone,
>
> the entire point of using a fixed internal representation for strings (as
> opposed to strings in a user-defined charset encoding, depending on locale
> settings) was raised by me on the fear that some gettext() strings could
> legally contain bytes of vale 0x00.  Since wget strings use the traditional
> zero-termination approach, this would be a problem, cause then strings
> would appear truncated.
>
> This kind of problem is traditionally solved using a different approach to
> strings, with structs that carry the length in a field (by the way, this
> also tends to be more robust on a security point of view), but this is
> ruled out cause it would need a major rewriting of wget, and mbr2wc the
> same. And by the way it could be more of a theoretical problem, cause
> actually I've never seen any real locale set to e.g. UTF-16, but then who
> knows?
>
> Anyway, you can force gettext to always convert strings to a fixed charset
> of your choice.  utf-8 is a common charset enconding powerful enough to
> represent every other charset, forbids zeroes in the middle of the string
> and represent us-ascii chars exactly the same as us-ascii does, so this
> appears as the best solution.  Then, just before printing the string out,
> you can transcode it to the current locale.  This way only small portions
> of code would be impacted and the transition would be transparent to the
> rest of the internal machinery.


yes, the problem with gettext is rather easy to fix. theoretically, all you 
have to do is to set up the proper output charset with the  
bind_textdomain_codeset function:

http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC149

now, the real problem comes with the strings received from a server. since at 
the moment we interpret those strings as multibyte characters according to 
the local charset, we may well have null bytes in the middle of a valid 
string received from the server - which we cannot pass directly to write(2) 
since 99% of the times it must be interpolated inside another string 
retrieved via gettext and then printed.

so, the only solution i can think of is, as simone says, to adopt the UTF8 
encoding for ALL internal strings in wget and use struct string_t to store 
strings in memory. but we have to do it in a portable way that does not break 
compatibility with older/legacy systems. maybe on systems that don't have 
iconv(3) we can use libiconv or libicu or simply revert to plain ascii 
behaviour (see below).

> I'm not sure if/how this solution impacts portability on old systems (issue
> raised by DraÅen KaÄar).

i think that on legacy systems (and possibly on windows) we should at least 
support traditional plain ascii encoding. this means that we have to adopt a 
different escape strategy that tries to interpret all input as ascii and not 
as multbyte chars as the escape_string function based on mbrtowc(3) we have 
now.

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
Institute of Human & Machine Cognition   http://www.ihmc.us
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Re: new string module

Reply via email to