Re: auto-detecting file encoding

Jerry Yang Sun, 18 Jun 2006 20:12:10 -0700

Hi,

The file in UTF-8 should have a BOM like this "EF BB BF"

Bytes	Encoding Form
00 00 FE FF	UTF-32, big-endian
FF FE 00 00	UTF-32, little-endian
FE FF	UTF-16, big-endian
FF FE	UTF-16, little-endian
EF BB BF	UTF-8

On 6/17/06, [EMAIL PROTECTED] < [EMAIL PROTECTED]> wrote:

Hello:
   I need to process the text of thousands of files automatically, with simple regexp substitutions.  The problem I have is that, although all files are plaintext, they have been written with a variety of programs in Windows, so they employ diverse encodings.  For example, some are in 'utf-8', others in 'windows-1252', and some in 'latin-1'.

   I was in the process of whipping out a script to run through these encodings (using Encode::decode) to try to find the best one for each, but I came across an unforgivable realization:  a single two-byte Unicode character in UTF-8 look suspiciously like two single-byte ANSI (windows-1252) characters.

This is what I have so far.  It seems to work fine (so far), but I'm not sure how reliable it is:

### START OF CODE

open(TXT_FH, "< $insert_path") die("Unable to open file $insert_path: $!");

# Attempt to read the file and decode the characters using
# various encoding schemes.  This is to work around the
# mess of formats and characters in the insert files.
my @encodings = ('utf-8', 'windows-1252', 'iso-8859-1');
my $raw_data = '';
my $utf8_txt = '';
my $enc_idx  = 0;
while( my $bytes = read(TXT_FH, my $buffer, 16) )
    {
    # buffer may end in a partial character so we append
    $raw_data .= $buffer;
    DECODE: while ($raw_data)
        {
        if ($enc_idx > $#encodings)
            {
            $utf8_txt .= Encode::decode('utf-8', $raw_data, Encode::FB_PERLQQ);
            $enc_idx = 0;
            last DECODE;
            }

        # $data now contains the unprocessed partial character
        $utf8_txt .= Encode::decode($encodings[$enc_idx], $raw_data, Encode::FB_QUIET);
        $enc_idx++ if ($raw_data);
        }
    }
close(TXT_FH);

### END OF CODE

Is there a better way to detect them?  I need to make sure to interpret the encoding correctly, because later on I need to generate XML files with correct UTF-8.

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: auto-detecting file encoding

Reply via email to