Re: character encoding on file upload name
Andrew Mace wrote: I've noticed that the non-ASCII characters are getting split into their base code pointsI thought diacritical marks were always combined with their preceding letter, if possible. You're talking of file names, I suppose. I think you'll find that this is a function of the file system which stores file names in decomposed form, for what reason maybe someone else can tell you. It is nothing to do with the behaviour of Perl, and you will find (I think, because I am at the moment working in MacOS 9/WinNT) that it is impossible to create a file named été (decomposed) in addition to a file named été (composed) in the same location. JD
Re: character encoding on file upload name
On Apr 7, 2005, at 3:04 PM, John Delacour wrote: You're talking of file names, I suppose. I think you'll find that this is a function of the file system which stores file names in decomposed form, for what reason maybe someone else can tell you. So that the OS can quickly compare filenames in a case-independent fashion. -Ken
character encoding on file upload name
A few months ago, I thought I had finally figured out all there was to know about character encodings to prevent hosing output and getting wide char in print errors. However, I recently noticed that if I attempt to upload a file with non-ASCII characters in the name, I can't call a simple Encode::decode_utf8 to write this name into a file. The strange thing is that if I upload and save this file, it preserves the file name. If I write just the name to a file with binmode :bytes and then cat that file to terminal, it displays it correctly. However, if I open the :bytes file in BBEdit set to render utf-8 no BOM, the characters don't display correctly. If I write the filename to a file with binmode :utf8, I think I get double-encoding and nothing displays it correctly. If I use decode_utf8($filename, Encode::FB_CROAK) (or Encode::FB_WARN), then the utf8 flag on the string gets set and is_utf8 returns true. Otherwise it doesn't. So, it seems there are characters in the string that are causing decode_utf8 to return early. Regardless, the filename string is not valid utf-8. Other form fields will give me the expected utf-8 string when I call decode_utf8 (in warn/cat/terminal, http/filename output). So, I'm wondering what the difference could be... a CGI module issue? multipart/form-data POST issue? I've attached a web script example below that will hopefully come through okay for anyone who's interested. Try a filename with some non-ASCII characters and see what happens. Any insights would be appreciated. Thanks Andrew #!/usr/bin/perl use strict; use utf8; use Encode qw(is_utf8 decode_utf8); use CGI qw(:cgi uploadInfo); use IO::File; binmode(*STDOUT, :utf8); print Content-Type: text/html; charset=utf-8\n\n; print HTML; html head titletest/title /head body HTML if($ENV{REQUEST_METHOD} eq 'POST') { my $buf; my $rfh = upload('file'); my $f = param('file'); my $n = $f; $f = decode_utf8($f); ($f) = $f =~ m/([^\/\\]+)$/; my $fh = new IO::File; $fh-open(' test.txt'); binmode($fh, ':bytes'); # :utf8 ? print $fh $f, \n; # just write the filename into this file $fh-close; $fh-open( $f); # save the file itself with its original name binmode($fh, ':bytes'); while(read($rfh, $buf, 1024)) { print $fh $buf; } $fh-close; print 'file name = ', $f, ' / '; $f =~ s/./sprintf(0x%02x , ord($))/eg; # check char codes print $f, 'br /br /'; my $vals = uploadInfo(param('file')); for(keys %{$vals}) { print $_, ': ', decode_utf8($vals-{$_}), 'br /'; } print 'br /text field t1 = ', decode_utf8(param('t1')), 'br /'; print 'text field t2 = ', decode_utf8(param('t2')), 'br /'; } print HTML; br / form name=test method=post action=test.cgi enctype=multipart/form-data input type=text name=t1 value= s /br / input type=text name=t2 value=nth s /br / input type=file name=file /br /br / input type=submit / /form /body /html HTML
Re: character encoding on file upload name
At 12:08 pm -0400 6/4/05, Andrew Mace wrote: Any insights would be appreciated. What happens if you comment out #use utf8; ... #binmode(*STDOUT, :utf8); ... #binmode($fh, ':bytes'); # :utf8 ? ... #binmode($fh, ':bytes'); It seems to work then as you want: http://cgi.bd8.com/cgi-bin/test050406.cgi JD
Re: character encoding on file upload name
On Apr 6, 2005, at 3:03 PM, John Delacour wrote: At 12:08 pm -0400 6/4/05, Andrew Mace wrote: Any insights would be appreciated. What happens if you comment out #use utf8; ... #binmode(*STDOUT, :utf8); ... #binmode($fh, ':bytes'); # :utf8 ? ... #binmode($fh, ':bytes'); It seems to work then as you want: http://cgi.bd8.com/cgi-bin/test050406.cgi JD Well, I get the same result as before - file itself saves okay, HTML page reports back okay, but when viewing in BBEdit (UTF-8, no BOM), the name of the file in test.txt doesn't render correctly - other form fields with extended charset data do, though. When I paste into Mail, though, it looks fine, so I have no idea what's going on. When I write the name to file with :utf8 and then try to read back in as :utf8 to send back to the browser, it gets totally hosed, that is: $filename = decode_utf8($filename); $fh-open(' test.txt'); binmode($fh, ':utf8'); print $fh $filename; $fh-close; $fh-open(' test.txt'); binmode($fh, ':utf8'); $filename = $fh; $fh-close; # HTML stuff print $filename, 'br /'; If the string's utf8 flag is enabled, perl won't try to reencode when I write to a :utf8 opened file, right? I just don't understand why this field is different from the other, non-file, form fields in multipart/form-data. Shouldn't I be using :utf8? If I don't always use that layer, things can easily get corrupted, right? Double encoding, and such as I reopen and append, etc.? And since I have UTF-8 characters in my script, I should use utf8; to let perl know and binmode(*STDOUT,':utf8') so that I'm not lying when I say Content-Type: text/html; charset=utf-8? Thanks Andrew