Re: unicode encoding question

Mark Leighton Wed, 02 Apr 2008 13:05:36 -0700

Hi Eric,

You really need to invoke the encoding IO layers on the open() call.  once 
you've done that, then input and output need nothing special to make them 
work.  The trick is determining what kind of file you are dealing with and 
what set of filters to specify.


These are the functions that I'm using regularly (on a Windows platform, since 
you didn't specify):


#################################################
#
# get_file_encoding
#
# Determine how a file is encoded and return an encoding string for
# correctly opening the file for reading.
#
#  eg.   my ( $encoding, $bom ) = get_file_encoding( $path );
#        open( my $fh, '<' . $encoding, $path ) or die;
#        skip_bom( $fh, $bom );
#

sub get_file_encoding {
     my $path = shift;
     my $encoding = '';
     my $bom = '';

     if ( open( my $file, '<', $path ) ) {
         my $header;
         if ( read( $file, $header, 4, 0 ) == 4 ) {
             my $header = unpack( 'N', $header );
             if ( ( $header & 0xffffff00 ) == 0xefbbbf00 ) {
                 $encoding = ":encoding(utf8)";
                 $bom = pack( 'C3', 0xef, 0xbb, 0xbf );

             } elsif ( ( $header & 0xffffffff ) == 0xfffe0000 ) {
                 $encoding = ":encoding(UTF-32LE)";
                 $bom = pack( 'C4', 0xff, 0xfe, 0x00, 0x00 );

             } elsif ( ( $header & 0xffffffff ) == 0xfeff0000 ) {
                 $encoding = ":encoding(UTF-32BE)";
                 $bom = pack( 'C4', 0xfe, 0xff, 0x00, 0x00 );

             } elsif ( ( $header & 0xffff0000 ) == 0xfffe0000 ) {
                 $encoding = ":encoding(UTF-16LE)";
                 $bom = pack( 'C2', 0xff, 0xfe );

             } elsif ( ( $header & 0xffff0000 ) == 0xfeff0000 ) {
                 $encoding = ":encoding(UTF-16BE)";
                 $bom = pack( 'C2', 0xfe, 0xff );
             }
         }

         close( $file );
     }

     return ( wantarray ? ( $encoding, $bom ) : $encoding );
}



#################################################
#
# set_file_encoding
#
# Generate encoding and bom strings for use when correctly
# opening UTF-8/Unicode files for writing.
#
#  eg.   my ( $encoding, $bom ) = set_file_encoding( 'UTF-16' );
#        open( my $fh, '>' . $encoding, $path ) or die;
#        write_bom( $fh, $bom );
#

sub set_file_encoding {
     my $codepage = shift;
     $codepage = 'utf8'        if ( uc($codepage) eq 'UTF-8' );
     $codepage = 'utf8'        if ( uc($codepage) eq 'UTF8' );
     $codepage = 'UTF-16LE'    if ( uc($codepage) eq 'UTF-16' );
     $codepage = 'UTF-16LE'    if ( uc($codepage) eq 'UTF16' );
     $codepage = 'UTF-16BE'    if ( uc($codepage) eq 'UTF16BE' );
     $codepage = 'UTF-32LE'    if ( uc($codepage) eq 'UTF-32' );
     $codepage = 'UTF-32LE'    if ( uc($codepage) eq 'UTF32' );
     $codepage = 'UTF-32BE'    if ( uc($codepage) eq 'UTF32BE' );
     $codepage = 'iso-8859-1'  if ( uc($codepage) eq 'ASCII' );
     $codepage = 'iso-8859-1'  if ( uc($codepage) eq 'ANSI' );

     my $encoding = sprintf( ':raw:encoding(%s):crlf:utf8', $codepage );

     my $bom = '';
     $bom = "\x{feff}"  unless ( $codepage eq 'iso-8859-1' );

     return ( wantarray ? ( $encoding, $bom ) : $encoding );
}



#################################################
#
# skip_bom
#
# Move the file pointer to start reading after any Byte-Order-Marker
# detected by file_encoding().
#

sub skip_bom {
     my ( $file_handle, $bom ) = @_;
     seek( $file_handle, length( $bom ), 1 );
}



#################################################
#
# write_bom
#
# Write a Byte-Order-Marker to the given file handle.
#

sub write_bom {
     my ( $file_handle, $bom ) = @_;
     print( $file_handle $bom );
}


Cheers,
Mark

-------- Original Message  --------
Subject: unicode encoding question
From: eric clark <[EMAIL PROTECTED]>
To: perl-win32-admin@listserv.ActiveState.com
Date: Wednesday, April 02, 2008 2:48:24 PM

> I have a file that I am attempting to parse that is in unicode.  Here is 
> the code I am using:
> 
> use Encode;
> 
> $enc = find_encoding("ascii");
> 
> open( OUTFP, ">output.txt" ) || die "Error opening output.txt: $!\n";
> open( VER, "Ver.htm" ) || die "Error opening Vers.htm: $!\n";
> 
> while( <VER> )
> {
>     ## Regular expression goes here ##
>     $line = $enc->encode( $_ );
>     print OUTFP "$line\n\n\n";
> 
> }
> 
> close( VER );
> close( OUTFP );
> 
> I've tried using every encoding installed with that module, both decode 
> and encode and the output is always the same.  Basically the file I  am 
> reading is unicode, so all the characters are padded.  I want this to be 
> either decoded into a normal text file, or at least be able to use the 
> regular expression.  No matter what the expression always fails.
> 
> Any ideas?
> 
> Thanks,
>     Eric
> 
> "I'd take you seriously but to do so would be an affront to your 
> intelligence."
> -- William F. Buckley --
> 
> 
> 
> ------------------------------------------------------------------------
> Use video conversation to talk face-to-face with Windows Live Messenger. 
> Get started! 
> <http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_042008>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Perl-Win32-Admin mailing list
> Perl-Win32-Admin@listserv.ActiveState.com
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________
Perl-Win32-Admin mailing list
Perl-Win32-Admin@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: unicode encoding question

Reply via email to