Re: About HTML unicode

Ben Morrow Fri, 03 Dec 2004 17:20:46 -0800

Quoth [EMAIL PROTECTED] (John Delacour):
> At 12:31 am +0800 3/12/04, He Zhiqiang wrote:
> 
> >Now i encountered another problem,  there are a few files contains 
> >not only one charset but also two or more, for example, file1 
> >contains japanese and chinese, if i use open() to  load the data 
> >into memory, ord and length etc.. can't correctly work! Perhasp i 
> >miss something to encode or decode the data ?
> >code:
> >#!/usr/bin/perl -w
> >use utf8;
> >open(FD, "< file1");
> >while(<FD>) {
> >chomp;
> >print "length = ".length($_);
> >}
> >close FD;
> >----------
> >length() can not count the correct non-ASCII characters. :(
> 
> If the file is in UTF-8, then it may be in any number of _languages_ 
> but it uses only one character set -- Unicode.  So far as I know "use 
> utf8" is now redundant and ineffectual in Perl.


Both utf8.pm and encoding.pm alter the encoding Perl considers your
*source file* to be in. This is different from what utf8.pm did under
5.6.

> You will get the 
> correct character count (6 characters rather than 18 bytes) by 
> opening the file handle as utf-8 as below.
> 
> no warnings;
> my $f = "/tmp/cjk.txt";
> my $text = "\x{56d8}\x{56d9}\x{56da}\x{56db}\x{56dc}\x{56dd}\n";
> open F, ">$f";

binmode F;

both for portability and in case of some environment setting (PERLIO,
the locale variables with 5.8.0 or -C) having set some other encoding on
the data.

> print F $text; # writes $text to $f as UTF-8

utf8::encode $text; # make sure $text is a a sequence of octets not
                    # characters
print F $text;

> close F;
> open F, "<:utf8",  $f;
> for (<F>) {
>    chomp;
>    print "$_  -  Length = " . length() . $/;
> }

Ben

-- 
  Joy and Woe are woven fine,
  A Clothing for the Soul divine       William Blake
  Under every grief and pine          'Auguries of Innocence'
  Runs a joy with silken twine.                                [EMAIL PROTECTED]

Re: About HTML unicode

Reply via email to