Re: perl unicode support

Rich Felker Thu, 29 Mar 2007 09:15:49 -0800

On Thu, Mar 29, 2007 at 12:24:43PM +0200, Egmont Koblinger wrote:
> On Wed, Mar 28, 2007 at 05:57:35PM -0400, ＳｒｉｎＴｕａｒ wrote:
> 
> > The regex library can ask the locale what encoding things are in, just
> > like everybody else
> 
> The locale tells you which encoding your system uses _by default_. This is
> not necessarily the same as the data you're currently working with.


The word “default” does not appear in any standard regarding LC_CTYPE.
It determines THE encoding of text. Foreign character data from other
systems obviously cannot be treated directly as text under this view.

> write a console mp3 id3v2 editor if you completely ignored the console's
> charset

The console charset uses text and text is encoded according to
LC_CTYPE. The tags are encoded according to the encoding specified by
the file and may be converted via iconv or similar library calls.

> or the charset used within the id3v2 tags? How would you write a
> database frontend if you completely ignored the local charset as well as the
> charset used in the database? (Someone inserts some data, someone else
> queries it and receives different letters...)

The same problem exists on the filesystem. The solution locally is to
mandate a policy of a single encoding for all users sharing data. For
remote protocols, the protocol usually specifies an encoding by which
the data is delivered, so again you convert according to iconv or
similar.

Nowhere have SrinTuar nor myself said that encoding is always
something you can ignore. My point is that consideration of it can be
fully isolated to the point at which badly-encoded data is received
(from text embedded in a binary file, from http, from mime mail, etc.)
such that the other 99% of your software never has to think about it.

> > >There _are_ many character sets out there, and it's _your_ job, the
> > >programmer's job to tell the compiler/interpreter how to handle your bytes
> > >and to hide all these charset issues from the users. Therefore you have to
> > >be aware of the technical issues and have to be able to handle them.
> > 
> > If that was true then the vast majority of programs would not be i18n'd..
> 
> That's false. Check for example the bind_textdomain_codeset call. In Gtk+-2
> apps you call it with an UTF-8 argument. This happens because you _know_
> that you'll need this data encoded in UTF-8.

Then what do you do when you want to print text to stdout, or generate
filenames, etc.? You can’t use your localized text anymore because the
encoding may not match. This is evidence that gtk’s approach is
flawed.

> > I wish perl would let me do that- it works so well in C.
> 
> I already wrote twice. Just in case you haven't seen it, I write it for the
> third time. Perl _lets_ you think/work in bytes. Just ignore everything
> related to UTF-8. Just never set the utf8 mode. You'll be back at the world
> of bytes. It's so simple!

I don’t know about SrinTuar but this is not what I meant at all. I
want (NEED!) regex to work correctly, etc. Thus Perl needs to respect
the character encoding, which thankfully matches the host encoding,
UTF-8. No problem so far. However, as soon as I try to send these Perl
character strings (which are equally valid as host character strings)
to stdout, it spews warnings, and does so in an inconsistent way!
(i.e. it complains about characters above 255 but not characters
128-255)

> > Their internal utf-16 mandate was a mistake, imo.
> 
> That was not utf-16 but ucs-2 at that time and imo those days it was a
> perfectly reasonable decision.

It was not. UCS-2 was already obsolete at the time Java was released
to the public in 1995. UTF-8 was invented in September 1992.

> > (and the locale should always say utf-8)
> 
> Should, but doesn't. It's your choice to decide whether you want your
> application to work everywhere, or only under utf-8 locales.

Having limited functionality (displaying ??? for all characters not
available in the locale) under broken legacy locales is perfectly
acceptable behavior. If someone wants to use/display/write a
character, they need to use a character encoding where that character
is encoded!!!

> I admit that in an ideal world everything would be encoded in UTF-8. Just
> don't forget: our world is not ideal. My browser has to display web pages
> encoded in Windows-1250 correctly. My e-mail client has to display messages
> encoded in iso-8859-2 correctly. And so on...

As you can read above, none of this is contrary to what I said. My
system does all of this quite well.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to