The file in UTF-8 should have a BOM like this "EF BB BF"
Bytes | Encoding Form |
---|---|
00 00 FE FF | UTF-32, big-endian |
FF FE 00 00 | UTF-32, little-endian |
FE FF | UTF-16, big-endian |
FF FE | UTF-16, little-endian |
EF BB BF | UTF-8 |
On 6/17/06, [EMAIL PROTECTED] <
[EMAIL PROTECTED]> wrote:
Hello:
I need to process the text of thousands of files automatically, with simple regexp substitutions. The problem I have is that, although all files are plaintext, they have been written with a variety of programs in Windows, so they employ diverse encodings. For example, some are in 'utf-8', others in 'windows-1252', and some in 'latin-1'.
I was in the process of whipping out a script to run through these encodings (using Encode::decode) to try to find the best one for each, but I came across an unforgivable realization: a single two-byte Unicode character in UTF-8 look suspiciously like two single-byte ANSI (windows-1252) characters.
This is what I have so far. It seems to work fine (so far), but I'm not sure how reliable it is:
### START OF CODE
open(TXT_FH, "< $insert_path") die("Unable to open file $insert_path: $!");
# Attempt to read the file and decode the characters using
# various encoding schemes. This is to work around the
# mess of formats and characters in the insert files.
my @encodings = ('utf-8', 'windows-1252', 'iso-8859-1');
my $raw_data = '';
my $utf8_txt = '';
my $enc_idx = 0;
while( my $bytes = read(TXT_FH, my $buffer, 16) )
{
# buffer may end in a partial character so we append
$raw_data .= $buffer;
DECODE: while ($raw_data)
{
if ($enc_idx > $#encodings)
{
$utf8_txt .= Encode::decode('utf-8', $raw_data, Encode::FB_PERLQQ);
$enc_idx = 0;
last DECODE;
}
# $data now contains the unprocessed partial character
$utf8_txt .= Encode::decode($encodings[$enc_idx], $raw_data, Encode::FB_QUIET);
$enc_idx++ if ($raw_data);
}
}
close(TXT_FH);
### END OF CODE
Is there a better way to detect them? I need to make sure to interpret the encoding correctly, because later on I need to generate XML files with correct UTF-8.
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
_______________________________________________ ActivePerl mailing list ActivePerl@listserv.ActiveState.com To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs