A few months ago, I thought I had finally figured out all there was to know about character encodings to prevent hosing output and getting "wide char in print" errors.

However, I recently noticed that if I attempt to upload a file with non-ASCII characters in the name, I can't call a simple Encode::decode_utf8 to write this name into a file. The strange thing is that if I upload and save this file, it preserves the file name. If I write just the name to a file with binmode :bytes and then cat that file to terminal, it displays it correctly. However, if I open the :bytes file in BBEdit set to render utf-8 no BOM, the characters don't display correctly. If I write the filename to a file with binmode :utf8, I think I get double-encoding and nothing displays it correctly.

If I use decode_utf8($filename, Encode::FB_CROAK) (or Encode::FB_WARN), then the utf8 flag on the string gets set and is_utf8 returns true. Otherwise it doesn't. So, it seems there are characters in the string that are causing decode_utf8 to return early. Regardless, the filename string is not valid utf-8.

Other form fields will give me the expected utf-8 string when I call decode_utf8 (in warn/cat/terminal, http/filename output). So, I'm wondering what the difference could be... a CGI module issue? multipart/form-data POST issue?

I've attached a web script example below that will hopefully come through okay for anyone who's interested. Try a filename with some non-ASCII characters and see what happens.

Any insights would be appreciated.

Thanks
Andrew



#!/usr/bin/perl

use strict;
use utf8;
use Encode qw(is_utf8 decode_utf8);
use CGI qw(:cgi uploadInfo);
use IO::File;

binmode(*STDOUT, ":utf8");

print "Content-Type: text/html; charset=utf-8\n\n";
print <<HTML;
<html>
<head>
<title>test</title>
</head>
<body>
HTML

if($ENV{REQUEST_METHOD} eq 'POST') {

        my $buf;
        my $rfh = upload('file');
        my $f = param('file');
        my $n = $f;
                
        $f = decode_utf8($f);
        ($f) = $f =~ m/([^\/\\]+)$/;

        my $fh = new IO::File;
        $fh->open('> test.txt');
        binmode($fh, ':bytes'); # :utf8 ?
        print $fh $f, "\n"; # just write the filename into this file
        $fh->close;
        
        $fh->open("> $f"); # save the file itself with its original name
        binmode($fh, ':bytes');
        while(read($rfh, $buf, 1024)) {
                print $fh $buf;
        }
        $fh->close;
        
        print 'file name = ', $f, ' / ';
        $f =~ s/./sprintf("0x%02x ", ord($&))/eg; # check char codes
        print $f, '<br /><br />';
        
        my $vals = uploadInfo(param('file'));
        for(keys %{$vals}) {
                print $_, ': ', decode_utf8($vals->{$_}), '<br />';
        }

        print '<br />text field t1 = ', decode_utf8(param('t1')), '<br />';
        print 'text field t2 = ', decode_utf8(param('t2')), '<br />';
}

print <<HTML;
<br />
<form name="test" method="post" action="test.cgi" enctype="multipart/form-data">


<input type="text" name="t1" value="å †és†" /><br />
<input type="text" name="t2" value="ånøthé® †és†" /><br />
<input type="file" name="file" /><br /><br />

<input type="submit" />

</form>
</body>
</html>
HTML



Reply via email to