Re: character encoding on file upload name

2005-04-07 Thread John Delacour
Andrew Mace wrote:
I've noticed that the non-ASCII characters are getting split into their 
base code pointsI thought diacritical marks were always combined with
 their preceding letter, if possible.
You're talking of file names, I suppose. I think you'll find that this 
is a function of the file system which stores file names in decomposed 
form, for what reason maybe someone else can tell you.  It is nothing to 
do with the behaviour of Perl, and you will find (I think, because I am 
at the moment working in MacOS 9/WinNT) that it is impossible to create 
a file named été (decomposed) in addition to a file named été (composed) 
in the same location.

JD


Re: character encoding on file upload name

2005-04-07 Thread Ken Williams
On Apr 7, 2005, at 3:04 PM, John Delacour wrote:
You're talking of file names, I suppose. I think you'll find that this 
is a function of the file system which stores file names in 
decomposed form, for what reason maybe someone else can tell you.
So that the OS can quickly compare filenames in a case-independent 
fashion.

 -Ken


character encoding on file upload name

2005-04-06 Thread Andrew Mace
A few months ago, I thought I had finally figured out all there was to 
know about character encodings to prevent hosing output and getting 
wide char in print errors.

However, I recently noticed that if I attempt to upload a file with 
non-ASCII characters in the name, I can't call a simple 
Encode::decode_utf8 to write this name into a file.  The strange thing 
is that if I upload and save this file, it preserves the file name.  If 
I write just the name to a file with binmode :bytes and then cat that 
file to terminal, it displays it correctly.  However, if I open the 
:bytes file in BBEdit set to render utf-8 no BOM, the characters don't 
display correctly.  If I write the filename to a file with binmode 
:utf8, I think I get double-encoding and nothing displays it correctly.

If I use decode_utf8($filename, Encode::FB_CROAK) (or Encode::FB_WARN), 
then the utf8 flag on the string gets set and is_utf8 returns true.  
Otherwise it doesn't.  So, it seems there are characters in the string 
that are causing decode_utf8 to return early.  Regardless, the filename 
string is not valid utf-8.

Other form fields will give me the expected utf-8 string when I call 
decode_utf8 (in warn/cat/terminal, http/filename output).  So, I'm 
wondering what the difference could be... a CGI module issue? 
multipart/form-data POST issue?

I've attached a web script example below that will hopefully come 
through okay for anyone who's interested.  Try a filename with some 
non-ASCII characters and see what happens.

Any insights would be appreciated.
Thanks
Andrew

#!/usr/bin/perl
use strict;
use utf8;
use Encode qw(is_utf8 decode_utf8);
use CGI qw(:cgi uploadInfo);
use IO::File;
binmode(*STDOUT, :utf8);
print Content-Type: text/html; charset=utf-8\n\n;
print HTML;
html
head
titletest/title
/head
body
HTML
if($ENV{REQUEST_METHOD} eq 'POST') {
my $buf;
my $rfh = upload('file');
my $f = param('file');
my $n = $f;

$f = decode_utf8($f);
($f) = $f =~ m/([^\/\\]+)$/;
my $fh = new IO::File;
$fh-open(' test.txt');
binmode($fh, ':bytes'); # :utf8 ?
print $fh $f, \n; # just write the filename into this file
$fh-close;

$fh-open( $f); # save the file itself with its original name
binmode($fh, ':bytes');
while(read($rfh, $buf, 1024)) {
print $fh $buf;
}
$fh-close;

print 'file name = ', $f, ' / ';
$f =~ s/./sprintf(0x%02x , ord($))/eg; # check char codes
print $f, 'br /br /';

my $vals = uploadInfo(param('file'));
for(keys %{$vals}) {
print $_, ': ', decode_utf8($vals-{$_}), 'br /';
}
print 'br /text field t1 = ', decode_utf8(param('t1')), 'br /';
print 'text field t2 = ', decode_utf8(param('t2')), 'br /';
}
print HTML;
br /
form name=test method=post action=test.cgi 
enctype=multipart/form-data

input type=text name=t1 value= s /br /
input type=text name=t2 value=nth s /br /
input type=file name=file /br /br /
input type=submit /
/form
/body
/html
HTML


Re: character encoding on file upload name

2005-04-06 Thread John Delacour
At 12:08 pm -0400 6/4/05, Andrew Mace wrote:
Any insights would be appreciated.
What happens if you comment out
#use utf8;
...
#binmode(*STDOUT, :utf8);
...
#binmode($fh, ':bytes'); # :utf8 ?
...
#binmode($fh, ':bytes');
It seems to work then as you want:
 http://cgi.bd8.com/cgi-bin/test050406.cgi
JD


Re: character encoding on file upload name

2005-04-06 Thread Andrew Mace
On Apr 6, 2005, at 3:03 PM, John Delacour wrote:
At 12:08 pm -0400 6/4/05, Andrew Mace wrote:
Any insights would be appreciated.
What happens if you comment out
#use utf8;
...
#binmode(*STDOUT, :utf8);
...
#binmode($fh, ':bytes'); # :utf8 ?
...
#binmode($fh, ':bytes');
It seems to work then as you want:
 http://cgi.bd8.com/cgi-bin/test050406.cgi
JD

Well, I get the same result as before - file itself saves okay, HTML 
page reports back okay, but when viewing in BBEdit (UTF-8, no BOM), the 
name of the file in test.txt doesn't render correctly - other form 
fields with extended charset data do, though.  When I paste into Mail, 
though, it looks fine, so I have no idea what's going on.  When I write 
the name to file with :utf8 and then try to read back in as :utf8 to 
send back to the browser, it gets totally hosed, that is:

$filename = decode_utf8($filename);

$fh-open(' test.txt');
binmode($fh, ':utf8');
print $fh $filename;
$fh-close;

$fh-open(' test.txt');
binmode($fh, ':utf8');
$filename = $fh;
$fh-close;
# HTML stuff
print $filename, 'br /';
If the string's utf8 flag is enabled, perl won't try to reencode when I 
write to a :utf8 opened file, right?  I just don't understand why this 
field is different from the other, non-file, form fields in 
multipart/form-data.

Shouldn't I be using :utf8?  If I don't always use that layer, things 
can easily get corrupted, right?  Double encoding, and such as I reopen 
and append, etc.?  And since I have UTF-8 characters in my script, I 
should use utf8; to let perl know and binmode(*STDOUT,':utf8') so 
that I'm not lying when I say Content-Type: text/html; charset=utf-8?

Thanks
Andrew