> Message: 19
> From: "Sisyphus" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Subject: Re: pound sign trouble
> Date: Tue, 8 Jul 2003 16:55:46 +1000
> 
> 
> ----- Original Message -----
> From: "Peter Guzis" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Tuesday, July 08, 2003 11:30 AM
> Subject: RE: pound sign trouble
> 
> 
> > My Windows 2000 box exhibits the same behavior.  I believe 
> you are running
> into an ancient limitation of the DOS shell.  DOS and its 
> descendents have
> never been particularly good at handling extended characters. 
>  Back in the
> day, if you tried to print a text file containing those 
> characters to a
> printer it was not what one would expect.
> >
> > perl -e "print ord(someextendedcharacter);" will yield 
> wildly inaccurate
> results.  However, if you place the same statement in a perl 
> script file and
> run it, everything works as intended.
> >
> >
> 
> It seems to be always a matter of converting between 
> cp850(DOS) and cp1252
> (Windows) codesets - which can be done with Text::Iconv or 
> the Encode module
> (perl 5.8 only).
> 
> The annoying thing is that it's difficult to anticipate when such
> conversions are going to be necessary.
> 
> I would have expected that
> perl -e "print ord('£');"
> would produce '156', in which case no such conversion would be needed.
> 
> I expected that becauses if I run the following script and 
> enter the £
> symbol at the prompt, it prints 156.
> 
> my $sym = <STDIN>;
> chomp($sym);
> print ord($sym);
> 
> I am no longer surprised when my expectations are
> incorrect :-)
> 
> Cheers,
> Rob
> 


What Rob said...  Basically, when IBM(or ?) created the extended characterset, they 
reserved 127-161 as non usable characters (there a "few" exceptions).  Then, of 
course, Mircosoft ignored said standard and put in some "special" characters into 
those Decimal bytes places in the windows-1252 codepage.  Such characters as the "left 
qoute", "right quote", florin, elipse(looks like 3 periods, but is a single 
character), etc.

Note that this is NOT the same as Latin-1, Latin-2, or any other "Latin-x" ISO 
encoding.  All of the "Latin" ISO encodings properly follow this standard of not using 
these reserved characters.  This is something which is a hugh headache when using XML 
since XML parsers "assume" (in theory, we will "eventually" get there) that all passed 
data is encoded as utf-8.  If you try to parse a file containing either one of the 
"Latin-x" or windows-1252 characters in which the XML encoding has not been declared, 
then the parser will croak (this is what is is supposed to do by the way for any 
newbies).  This is to encorage users to start saving all their data in utf-8 
format(encoding) in the first place.  The basic thing is, most OS's do not support 
utf-8 directly in their shells at this point(correct me if I am wrong on this), and 
this is the problem you are seeing with the double character "glyph" junk which makes 
up a single character when read.

Hope that makes everything clear as mud for ya'll.

Joe
























_______________________________________________
Perl-Win32-Users mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to