Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread Marvin Humphrey
ice.  The first time, though, I get
a "wide character in print" warning.  That warning arises because Perl's
STDOUT is set to Latin-1 by default.  It wants to "downgrade" the UTF8 scalar
to Latin-1, but it can't do so without loss, so it warns and outputs the bytes
as is.  After we change STDOUT to 'utf8', the warning goes away.

The utf8::upgrade() call has no effect, because the scalar starts off as
UTF8.  Prior to the introduction of the UTF8 flag, there was no way to put
the code point \x{010d} into a Perl string because Latin-1 can't represent it.
For backwards compatibility reasons, \x escapes below 255 have to be
represented as Latin-1.   Since you asked for \x{010d}, though, Perl knows
that the backwards compat rules don't apply and it can use a UTF8 scalar.

HTH,

Marvin Humphrey



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote:

> So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is 
> that crap?

That's octal notation, which I think Dump() uses for any byte greater than 127
and for control characters, so that it can output pure ASCII.  

That sequence is only four bytes: 
  
  mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = "\303\204\302\215"; 
Encode::_utf8_on($s); Dump $s'
  SV = PV(0x801038) at 0x80e880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2012f0 "\303\204\302\215"\0 [UTF8 "\x{c4}\x{8d}"]
CUR = 4   <--- four bytes
LEN = 8
  mar...@smokey:~ $ 

The logical content of the string follows in the second quote:

>  [UTF8 "Tomas Laurinavi\x{c4}\x{8d}ius"]

That's valid UTF-8.

> my $str = 'Tomas Laurinaviius';

In source code, I try to stick to pure ASCII and use \x escapes -- like Dump()
does.

  my $str = "Tomas Laurinavi\x{c4}\x{8d}ius"

However, because those code points are both representable as Latin-1, Perl
will create a Latin-1 string.  If you want to force its internal encoding to
UTF-8, you need to do additional work.

  mar...@smokey:~ $ perl -MDevel::Peek -e '$s = "\x{c4}"; Dump $s; 
utf8::upgrade($s); Dump $s'
  SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2012e0 "\304"\0
CUR = 1
LEN = 4
  SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2008f0 "\303\204"\0 [UTF8 "\x{c4}"]
CUR = 2
LEN = 3
  mar...@smokey:~ $ 

> Confused and frustrated,

IMO, to get UTF-8 right consistently in a large Perl system, you need to
understand the internals and you need Devel::Peek at hand.  Perl tries to hide
the details, but there are too many ways for it to fail silently.  ("perl -C",
$YAML::Syck::ImplicitUnicode, etc.)

Marvin Humphrey



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
> I think what I need is some code to strip non-utf8 characters from a string
> -- even if that string has the utf8 bit switched on. I thought that Encode
> would do that for me, but in this case apparently not. Anyone got an
> example?

Tri this:

Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);

That will replace any byte sequences which are invalid UTF-8 with the Unicode
replacement character.  

If you want to guarantee that the flag is on first, do this:

utf8::upgrade($string);
Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);

Devel::Peek's Dump() function will come in handy for checking results.

Cheers,

Marvin Humphrey



Re: Win32 *W functions and old -C behavior

2006-11-01 Thread Marvin Humphrey


On Oct 31, 2006, at 11:57 PM, Oleg V. Volkov wrote:

Alas, I couldn't get to general discussion - link at end of -C  
discussion

seems to be either mangled in a way I can't restore or broken.


http://www.mail-archive.com/perl-unicode@perl.org/msg01963.html

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




Re: perlio probs

2005-09-23 Thread Marvin Humphrey


On Sep 22, 2005, at 10:28 PM, Martin Hosken wrote:
Does anyone have any experience of a bug I've encountered in 5.8.7  
only whereby occasionally (which is what makes it hard to report)  
this type of code:


$fh = IO::File->new("< input.dat") || die;
binmode $fh;
   # lots of code, even to different package

$fh->read($dat, $num_bytes);

does utf-8 conversion on $dat !!


It's a shot in the dark, but is there any correlation between the  
screwup and the value of the utf8 flag on $dat prior to the read call?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/