Re: UTF-16LE fails in substitution

Steve Larson Wed, 21 Sep 2005 02:08:02 -0700

Thanks David and Dan.  Comments inline.
"David Graff" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
>
> It might be worthwhile to investigate your UTF-16 input data file in hex
> before deciding what needs to be done to read it properly in Perl.
> Presumably, if you'll have lots of files of this flavor, they'll be
> consistent in relevant details, so you only need to check one at the
> outset, to understand what's really going on.  Does the file have line
> terminations like this:
>
>    0d 00 0a 00
>    <CR>  <LF>


I have been using a Hex viewer extensively to look at the input and the
output.
The input is (some do not have the BOM--most do):
FF FE 3C 00 71 00 75 00 65 00 72 00 79 00 44 00
<QueryD
65 00 66 00 69 00 6E 00 69 00 74 00 69 00 6F 00
efinitio
6E 00 20 00 78 00 6D 00 6C 00 6E 00 73 00 3A 00
n xmlns:
and so on...
I have only a few the have actual CRLF in them and had not been testing with
those in my testing so far.  I looked at one and the line terminations in it
are 0D 00 0A 00.  I had been assuming it should be 0D 0A (because it is a
single line ending and can fit in two bytes I guess :>) and had been
"shooting at the wrong target" here. Finding that 0D 00 0A 00 is correct is
very helpful.

>
> Also, if you are using Perl to write UTF-16 data to a file handle, you'll
> only get the BOM (and only your machine's _native_ byte order) when you
> specify the encoding as "UTF-16".  If you say "UTF-16LE", you override
your
> machine's native byte order (if necessary), and you don't get a BOM unless
> you explicitly write it yourself.

I may be a little confused here still.  The help that I included in the
first post said that "UTF-16 itself can be used for in-memory computations,
but if storage or transfer is required either UTF-16BE (big-endian) or
UTF-16LE (little-endian) encodings must be chosen" and "if you read a BOM,
you will know the byte order, since if it was written on a big-endian
platform, you will read the bytes 0xFE 0xFF, but if it was written on a
little-endian platform, you will read the bytes 0xFF 0xFE" indicated to me
that the BOM was only written when UTF-16BE or UTF-16LE were specified.  My
testing with this aligns with what you stated except that UTF-16 is output
in BE byte order and the LE byte order input was generated on a PC and my
program is running on a PC so I expect LE byte order.  Was forcing
\x{fffe}in the substitution causing Perl to reverse the byte order?  If so,
the output did not look like what I wanted resulting in me chasing other
options.

>
> As for line termination patterns on output, you probably need to control
> that separately, either by setting "$\" or using the ":crlf" IO-layer.
> (Are you trying to write platform-independent code, or are you just trying
> to cope with a specific plaform?)

I don't understand :>).  I tried setting $\ to \x0d\x0a but it did not seem
to make a difference so used \x0d\x0a directly instead of \n with sometimes
just gave \x0a.

>
> As for the code you posted at the top of this thread, note that "\x{fffe}"
> is the code point for "no such character" -- i.e. it is the one code point
> that is specifically left undefined/unassigned/unused so that the BOM code
> point "\x{feff}" will always work the way it is supposed to.
>
> The "\x{HHHH}" notation in perl refers to code points, not 16-bit
encodings
> of characters.  To write a correct BOM, you have to use "\x{feff}", no
> matter what your output encoding layer may be.

Ahh.  The info about code point vs. character is helpful as well.  That
alone could have saved me hours.

>
> There are other things I would suggest changing in the code you posted,
> like improving the way error conditions are handled, using "slurp"
> mode for reading the input data, and fixing the regex substitution, which
> looks pretty broken (BOM is wrong, captured strings are deleted rather
> than being included in the substitution string).

I am open to improvements in the code as I only deal with Perl maybe once a
year or so and would not consider myself an expert :>).  The reason I deal
with the errors with warn and the strange notation in the output is that
this is part of a large build system and the output is parsed by the build
system.  I cannot stop processing files when one fails.  So I create the
message on failure that the build system will surface and press on with the
rest of the files.

The following code works for the files I have to test it with right now.
Again, I would be interested in learning how to do the same thing better.
# read file to be updated
open VERSIONEDFILE, "<$working_file_list[0]" or warn "versionfiles : warning
: $working_file_list[0]--Unable to set version in file\n";

$filecontents = "";

@filecontents = <VERSIONEDFILE>;

$filecontents = join '', @filecontents; # pull array into a single string so
it can be parsed

close VERSIONEDFILE;

# Test for ASCII or UNICODE

my $decoder = guess_encoding($filecontents);

ref($decoder) or die "Can't guess: $decoder"; # trap error this way

# print "Decoder=\"", ref($decoder), "\" for file $working_file_list[0] \n";



if (ref($decoder) eq "Encode::XS") { # appears to be ASCII

            print "\t\tFile $working_file_list[0] appears to be ASCII.\n";

            # write updates to temporary file

            open VERSIONEDFILE, ">:encoding(ascii)", "$folder//_temp_file"
or

            warn "versionfiles : warning : $working_file_list[0]--Unable to
set version in file\n";

            # place <!-- Build Version: $build_number --> after

            # <?  ?> delimited comments that are supposed to be at the top
of the file

            $filecontents =~ s/((<\?.*?\?>)*)\n?/$1\n<!-- Build Version:
$build_number -->\n/s or

            warn "versionfiles : warning : $working_file_list[0]--Unexpected
format. Unable to parse to set version.\n";

            print VERSIONEDFILE $filecontents;

            close VERSIONEDFILE;

}

elsif (ref($decoder) eq "Encode::Unicode") {  # appears to be UNICODE

            print "\t\tFile $working_file_list[0] appears to be UNICODE.\n";

            open VERSIONEDFILE, "<:raw", "$working_file_list[0]" or

            warn "versionfiles : warning : $working_file_list[0]--Unable to
set version in file : $!\n";

            # one of many ways to slurp file.

            read VERSIONEDFILE, my $buffer, -s "$working_file_list[0]" or

            warn "versionfiles : warning : $working_file_list[0]--Unable to
set version in file because of possible Unrecognized BOM\n";

            close VERSIONEDFILE;

            if (eval {$filecontents = decode("UTF-16", $buffer)}) { # decode
files with a BOM

                        # place <!-- Build Version: $build_number --> after

                        # <?  ?> delimited comments that are supposed to be
at the top of the file

                        # \xFE\xFF is the BOM for UTF-16 and \xFF\xFE is the
code point for "no such character" -- 

                        # i.e. it is the one code point that is specifically
left

                        # undefined/unassigned/unused so that the BOM code
point "\x{feff}"

                        # will always work the way it is supposed to.

                        # \x0d\x0a needed as \n produces only \x0a in some
cases

                        $filecontents =~ s/((<\?.*?\?>)*)\n?/$1\x0d\x0a<!-- 
Build Version: $build_number -->\x0d\x0a/s or

                        warn "versionfiles : warning :
$working_file_list[0]--Unexpected format. Unable to parse to set
version.\n";

                        open VERSIONEDFILE, ">:raw", "$folder//_temp_file"
or

                        warn "versionfiles : warning :
$working_file_list[0]--Unable to set version in file : $!\n";

                        print VERSIONEDFILE encode("UTF-16", $filecontents);

            }

            else { # decode files without a BOM

                        $filecontents = decode("UTF-16LE", $buffer) or

                        warn "versionfiles : warning :
$working_file_list[0]--decode failed.\n"; # LE required in case there is no
BOM.

                        # place <!-- Build Version: $build_number --> after

                        # <?  ?> delimited comments that are supposed to be
at the top of the file

                        # \x0d\x0a needed as \n produces only \x0a in some
cases

                        $filecontents =~ s/((<\?.*?\?>)*)\n?/$1\x0d\x0a<!-- 
Build Version: $build_number -->\x0d\x0a/s or

                        warn "versionfiles : warning :
$working_file_list[0]--Unexpected format. Unable to parse to set
version.\n";

                        open VERSIONEDFILE, ">:raw", "$folder//_temp_file"
or

                        warn "versionfiles : warning :
$working_file_list[0]--Unable to set version in file : $!\n";

                        print VERSIONEDFILE encode("UTF-16LE",
$filecontents); # now be explicit on endianness

            }

            close VERSIONEDFILE;



}

else {  # appears to be neither ASCII nor UNICODE so abort

            print "versionfiles : warning : $working_file_list[0] appears to
be neither ASCII nor UNICODE so abort--Unexpected format. Unable to parse to
set version.\n";

};

# replace original file

system ("attrib -R \"$working_file_list[0]\"");

rename "$folder\\_temp_file", $working_file_list[0] or warn "versionfiles :
warning : $working_file_list[0]--Unable to set version in file\n";


>
> David Graff
>
>

Re: UTF-16LE fails in substitution

Reply via email to