Re: Can we print UTF-8 chars in Wx::TextCtrl fields?

Octavian Rasnita Mon, 29 Apr 2013 11:57:20 -0700

Hi Mark,

Thank you for this great explanation. Much clearer than otherdocumentations.


--Octavian

----- Original Message -----From: "Mark Dootson" <mark.doot...@znix.com>

To: <steveco.1...@gmail.com>; <wxperl-users@perl.org>
Sent: Monday, April 29, 2013 4:32 PM
Subject: Re: Can we print UTF-8 chars in Wx::TextCtrl fields?

Hi,
A Perl scalar has a character buffer to store character or byte data. Thisdata can be interpreted and stored by Perl in one of two formats:
1. Perl's internal data format
2. A number octets (bytes) representing a UTF-8 encoded string.
Internally it is just a memory buffer. Each scalar has a utf8 flag. Thistells Perl internally how to interpret its data buffer. Either as Perl'sinternal data format or as UTF-8 encoded text. If the utf8 flag is on,Perl regards the buffer as UTF-8 encoded text. If the utf8 flag is off,Perl regards the buffer as containing data in Perl's internal format.
So, say I load some binary data that I know is text encoded using'ISO-8859-1'.
Then I would do:

my $string = decode('ISO-8859-1', $binary);
This gets $string which contains data in Perl's internal format. The utf8flag for the scalar '$string' is off As you have noted below, I can't pass'$binary' to any of Perl's string functions. The results will beunpredictable and mostly bad.
The evil starts due to some special features when we use decode to converta UTF-8 encoded string.
my $string = decode('utf8', $binary);
If $binary can be converted to $string using single byte characters, then$string will be in Perl's internal data format and marked as such. (utf8flag off). If $binary contains multiple byte characters the $string willcontain a series of bytes representing a UTF-8 encoded string and thescalar '$string' will have the utf8 flag on.
Within Perl it should not matter whether the scalar is marked UTF-8 ornot - so long as the utf8 flag correctly reflects what's in the scalar'sdata buffer.
The problem comes when we come to pass the data to the wxWidgets library.

The source macro that does this is:

#define WXSTRING_INPUT( var, type, arg ) \
  var =  ( SvUTF8( arg ) ) ? \
           wxString( SvPVutf8_nolen( arg ), wxConvUTF8 ) \
         : wxString( SvPV_nolen( arg ), wxConvLibc );
So basically, if the scalar is marked as 'utf8' then it gets convertedinto a wxString as such. If not, you're at the mercy of libc and localsystem settings. It may work. It may not.
Solution - your conversion of external data should be

 my $string = decode($encoding, $binary);
 utf8::upgrade($string);
This should be platform independent and work - always. Perl's stringfunctions should all work OK on $string.
The key points

my $string = decode('utf8', $binary);
It depends on the content of $binary whether $string has the utf8 flagset.
my $string = decode('utf8', $binary);
utf8::upgrade( $string );
$string always has utf8 flag set. You could just do utf8::upgrade($binary)but that would be a special case for when $binary actually contains UTF-8bytes. The two step method applies to any encoding.
Perl can't know that a scalar contains UTF-8 encoded text unless you tellit.
The statement:
'use utf8;' Is not needed anywhere here of course as it indicates that thesource code is encoded in UTF-8. Nothing more. Functions utf8::upgradeetc. are always available.
If you have a list of scalars containing strings as in

@combo_options

then the same applies - to each individual scalar / string in the list.


Hope it helps.


Mark


On 29/04/2013 12:29, steveco.1...@gmail.com wrote:
Hi Mark, I'm a relative new comer to utf8 so please take everything I say
with a pinch of salt but your answer looks a bit qualified: if scalar, if
marked. That implies if I want a Perl list (say @combo_options) for a
Wx::ComboBox, then that won't work? Is that how it is?

And I don't know what marked means.
The real problem for me is that this feels like the wrong place todecode.There are lots of things I might want to do with a string before Idisplay
it.  I might want to sort it, or trim white space, or substitute a
place-marker with a value. And for these I need it to be decoded beforeI
process it.  If I have a very simple app with no string processing, then
this approach would be great but not otherwise.

I did have a lot of issues with utf8 at the beginning sometimes I had
display issues with a utf8 string and sometimes not. There seemed to benoparticular rhyme or reason to it. And as you say, it works differentlyonWindows and Linux. Finally, everything is very sensitive to smallerrors,
like having a non-existent style code.
So I use a policy which is that when I read a value into my program fromadatabase or a file, I always decode immediately. That way I know thatall
my variables are decoded and processable.

Then I encode before I write back to the file or db.

If I have an issue now, it is always where I have not done a:

$var = decode("utf8",$row->{ATT_BOOKING_COMMENT_TXT}) ;

Anyhow let me know what you think.

Regards

Steve

Re: Can we print UTF-8 chars in Wx::TextCtrl fields?

Reply via email to