Hi Mark,

Thank you for this great explanation. Much clearer than other documentations.

--Octavian

----- Original Message ----- From: "Mark Dootson" <mark.doot...@znix.com>
To: <steveco.1...@gmail.com>; <wxperl-users@perl.org>
Sent: Monday, April 29, 2013 4:32 PM
Subject: Re: Can we print UTF-8 chars in Wx::TextCtrl fields?


Hi,

A Perl scalar has a character buffer to store character or byte data. This data can be interpreted and stored by Perl in one of two formats:

1. Perl's internal data format
2. A number octets (bytes) representing a UTF-8 encoded string.

Internally it is just a memory buffer. Each scalar has a utf8 flag. This tells Perl internally how to interpret its data buffer. Either as Perl's internal data format or as UTF-8 encoded text. If the utf8 flag is on, Perl regards the buffer as UTF-8 encoded text. If the utf8 flag is off, Perl regards the buffer as containing data in Perl's internal format.

So, say I load some binary data that I know is text encoded using 'ISO-8859-1'.

Then I would do:

my $string = decode('ISO-8859-1', $binary);

This gets $string which contains data in Perl's internal format. The utf8 flag for the scalar '$string' is off As you have noted below, I can't pass '$binary' to any of Perl's string functions. The results will be unpredictable and mostly bad.

The evil starts due to some special features when we use decode to convert a UTF-8 encoded string.

my $string = decode('utf8', $binary);

If $binary can be converted to $string using single byte characters, then $string will be in Perl's internal data format and marked as such. (utf8 flag off). If $binary contains multiple byte characters the $string will contain a series of bytes representing a UTF-8 encoded string and the scalar '$string' will have the utf8 flag on.

Within Perl it should not matter whether the scalar is marked UTF-8 or not - so long as the utf8 flag correctly reflects what's in the scalar's data buffer.

The problem comes when we come to pass the data to the wxWidgets library.

The source macro that does this is:

#define WXSTRING_INPUT( var, type, arg ) \
  var =  ( SvUTF8( arg ) ) ? \
           wxString( SvPVutf8_nolen( arg ), wxConvUTF8 ) \
         : wxString( SvPV_nolen( arg ), wxConvLibc );


So basically, if the scalar is marked as 'utf8' then it gets converted into a wxString as such. If not, you're at the mercy of libc and local system settings. It may work. It may not.


Solution - your conversion of external data should be

 my $string = decode($encoding, $binary);
 utf8::upgrade($string);

This should be platform independent and work - always. Perl's string functions should all work OK on $string.

The key points

my $string = decode('utf8', $binary);

It depends on the content of $binary whether $string has the utf8 flag set.

my $string = decode('utf8', $binary);
utf8::upgrade( $string );

$string always has utf8 flag set. You could just do utf8::upgrade($binary) but that would be a special case for when $binary actually contains UTF-8 bytes. The two step method applies to any encoding.

Perl can't know that a scalar contains UTF-8 encoded text unless you tell it.

The statement:

'use utf8;' Is not needed anywhere here of course as it indicates that the source code is encoded in UTF-8. Nothing more. Functions utf8::upgrade etc. are always available.

If you have a list of scalars containing strings as in

@combo_options

then the same applies - to each individual scalar / string in the list.


Hope it helps.


Mark


On 29/04/2013 12:29, steveco.1...@gmail.com wrote:

Hi Mark, I'm a relative new comer to utf8 so please take everything I say
with a pinch of salt but your answer looks a bit qualified: if scalar, if
marked. That implies if I want a Perl list (say @combo_options) for a
Wx::ComboBox, then that won't work? Is that how it is?

And I don't know what marked means.

The real problem for me is that this feels like the wrong place to decode. There are lots of things I might want to do with a string before I display
it.  I might want to sort it, or trim white space, or substitute a
place-marker with a value. And for these I need it to be decoded before I
process it.  If I have a very simple app with no string processing, then
this approach would be great but not otherwise.

I did have a lot of issues with utf8 at the beginning sometimes I had
display issues with a utf8 string and sometimes not. There seemed to be no particular rhyme or reason to it. And as you say, it works differently on Windows and Linux. Finally, everything is very sensitive to small errors,
like having a non-existent style code.

So I use a policy which is that when I read a value into my program from a database or a file, I always decode immediately. That way I know that all
my variables are decoded and processable.

Then I encode before I write back to the file or db.

If I have an issue now, it is always where I have not done a:

$var = decode("utf8",$row->{ATT_BOOKING_COMMENT_TXT}) ;

Anyhow let me know what you think.

Regards

Steve



Reply via email to