character encoding on file upload name

2005-04-06 Thread Andrew Mace
A few months ago, I thought I had finally figured out all there was to 
know about character encodings to prevent hosing output and getting 
"wide char in print" errors.

However, I recently noticed that if I attempt to upload a file with 
non-ASCII characters in the name, I can't call a simple 
Encode::decode_utf8 to write this name into a file.  The strange thing 
is that if I upload and save this file, it preserves the file name.  If 
I write just the name to a file with binmode :bytes and then cat that 
file to terminal, it displays it correctly.  However, if I open the 
:bytes file in BBEdit set to render utf-8 no BOM, the characters don't 
display correctly.  If I write the filename to a file with binmode 
:utf8, I think I get double-encoding and nothing displays it correctly.

If I use decode_utf8($filename, Encode::FB_CROAK) (or Encode::FB_WARN), 
then the utf8 flag on the string gets set and is_utf8 returns true.  
Otherwise it doesn't.  So, it seems there are characters in the string 
that are causing decode_utf8 to return early.  Regardless, the filename 
string is not valid utf-8.

Other form fields will give me the expected utf-8 string when I call 
decode_utf8 (in warn/cat/terminal, http/filename output).  So, I'm 
wondering what the difference could be... a CGI module issue? 
multipart/form-data POST issue?

I've attached a web script example below that will hopefully come 
through okay for anyone who's interested.  Try a filename with some 
non-ASCII characters and see what happens.

Any insights would be appreciated.
Thanks
Andrew

#!/usr/bin/perl
use strict;
use utf8;
use Encode qw(is_utf8 decode_utf8);
use CGI qw(:cgi uploadInfo);
use IO::File;
binmode(*STDOUT, ":utf8");
print "Content-Type: text/html; charset=utf-8\n\n";
print <

test


HTML
if($ENV{REQUEST_METHOD} eq 'POST') {
my $buf;
my $rfh = upload('file');
my $f = param('file');
my $n = $f;

$f = decode_utf8($f);
($f) = $f =~ m/([^\/\\]+)$/;
my $fh = new IO::File;
$fh->open('> test.txt');
binmode($fh, ':bytes'); # :utf8 ?
print $fh $f, "\n"; # just write the filename into this file
$fh->close;

$fh->open("> $f"); # save the file itself with its original name
binmode($fh, ':bytes');
while(read($rfh, $buf, 1024)) {
print $fh $buf;
}
$fh->close;

print 'file name = ', $f, ' / ';
$f =~ s/./sprintf("0x%02x ", ord($&))/eg; # check char codes
print $f, '';

my $vals = uploadInfo(param('file'));
for(keys %{$vals}) {
print $_, ': ', decode_utf8($vals->{$_}), '';
}
print 'text field t1 = ', decode_utf8(param('t1')), '';
print 'text field t2 = ', decode_utf8(param('t2')), '';
}
print <










HTML


Re: character encoding on file upload name

2005-04-06 Thread John Delacour
At 12:08 pm -0400 6/4/05, Andrew Mace wrote:
Any insights would be appreciated.
What happens if you comment out
#use utf8;
...
#binmode(*STDOUT, ":utf8");
...
#binmode($fh, ':bytes'); # :utf8 ?
...
#binmode($fh, ':bytes');
It seems to work then as you want:
 
JD


Re: character encoding on file upload name

2005-04-06 Thread Andrew Mace
On Apr 6, 2005, at 3:03 PM, John Delacour wrote:
At 12:08 pm -0400 6/4/05, Andrew Mace wrote:
Any insights would be appreciated.
What happens if you comment out
#use utf8;
...
#binmode(*STDOUT, ":utf8");
...
#binmode($fh, ':bytes'); # :utf8 ?
...
#binmode($fh, ':bytes');
It seems to work then as you want:
 
JD

Well, I get the same result as before - file itself saves okay, HTML 
page reports back okay, but when viewing in BBEdit (UTF-8, no BOM), the 
name of the file in test.txt doesn't render correctly - other form 
fields with extended charset data do, though.  When I paste into Mail, 
though, it looks fine, so I have no idea what's going on.  When I write 
the name to file with :utf8 and then try to read back in as :utf8 to 
send back to the browser, it gets totally hosed, that is:

$filename = decode_utf8($filename);

$fh->open('> test.txt');
binmode($fh, ':utf8');
print $fh $filename;
$fh->close;

$fh->open('< test.txt');
binmode($fh, ':utf8');
$filename = <$fh>;
$fh->close;
# HTML stuff
print $filename, '';
If the string's utf8 flag is enabled, perl won't try to reencode when I 
write to a :utf8 opened file, right?  I just don't understand why this 
field is different from the other, non-file, form fields in 
multipart/form-data.

Shouldn't I be using :utf8?  If I don't always use that layer, things 
can easily get corrupted, right?  Double encoding, and such as I reopen 
and append, etc.?  And since I have UTF-8 characters in my script, I 
should "use utf8;" to let perl know and binmode(*STDOUT,':utf8') so 
that I'm not lying when I say Content-Type: text/html; charset=utf-8?

Thanks
Andrew



Re: character encoding on file upload name

2005-04-06 Thread Joel Rees
Just a random thought, have you tried hexdump on the file, the 
upload/download using the system and the browser(s) and whatever else 
you use, and the output of the script? hexdump should tell you whether 
things are changing or not.

(There may be a bug in hexdump relative to the bom. RH Linux's has such 
a bug and I haven't checked other systems for it yet. The bug shifts 
the displayed bytes relative to the interpreted bytes in some formats, 
if it's there, I guess I should check, but it may not be immediately. 
Anyway, hexdump should give some idea what's going on with all the 
different systems trying to help out.)



Re: character encoding on file upload name

2005-04-07 Thread Andrew Mace
On Apr 6, 2005, at 6:00 PM, Joel Rees wrote:
Just a random thought, have you tried hexdump on the file, the 
upload/download using the system and the browser(s) and whatever else 
you use, and the output of the script? hexdump should tell you whether 
things are changing or not.

(There may be a bug in hexdump relative to the bom. RH Linux's has 
such a bug and I haven't checked other systems for it yet. The bug 
shifts the displayed bytes relative to the interpreted bytes in some 
formats, if it's there, I guess I should check, but it may not be 
immediately. Anyway, hexdump should give some idea what's going on 
with all the different systems trying to help out.)
I've noticed that the non-ASCII characters are getting split into their 
base code points.  For example, U+00E9, Latin small letter E with 
acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf).  Is 
there a way to easily recombine the code points to get the original 
value?  It's strange to me that Encode::decode_utf8 doesn't do this.  I 
thought diacritical marks were always combined with their preceding 
letter, if possible.

Andrew


Re: character encoding on file upload name

2005-04-07 Thread Andrew Mace
With Randy's tip and my discovery of the Unicode::Normalize module, 
I've gotten things worked out.

use Unicode::Normalize qw(compose);
use Encode qw(decode_utf8);
...
my $f = decode_utf8(param('file'));
... write out the file itself with name in decomposed utf-8
$f = compose($f);
... now do something with filename in composed utf-8
etc.
Thanks to everyone who helped out.  I'm not sure what to do with my day 
now.

Andrew

On Apr 7, 2005, at 1:57 PM, Randy Boring wrote:
I've noticed that the non-ASCII characters are getting split into 
their
base code points.  For example, U+00E9, Latin small letter E with
acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf).  Is
there a way to easily recombine the code points to get the original
value?  It's strange to me that Encode::decode_utf8 doesn't do this.  
I
thought diacritical marks were always combined with their preceding
letter, if possible.

Andrew
You've run into the particular format of HFS+ filenames.  It's not just
any utf-8 encoding, most all of the Unicode characters that are
decomposable are decomposed, and must be so!
In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's
referred to as kUnicodeCanonicalDecompVariant.
In NSString.h there are functions for
decomposedStringWithCanonicalMapping (and precomposed- and
-CompatabilityMapping).  How you get to them from Perl, tho maybe
CamelBones?
A description of this text encoding (and the reason for it) are found 
at
  http://developer.apple.com/technotes/tn/tn1150.html

see especially
  http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames
and
  http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
Hope that helps a little,
 -Randy



Re: character encoding on file upload name

2005-04-07 Thread John Delacour
Andrew Mace wrote:
I've noticed that the non-ASCII characters are getting split into their 
base code pointsI thought diacritical marks were always combined with
> their preceding letter, if possible.
You're talking of file names, I suppose. I think you'll find that this 
is a function of the file system which stores file names in "decomposed" 
form, for what reason maybe someone else can tell you.  It is nothing to 
do with the behaviour of Perl, and you will find (I think, because I am 
at the moment working in MacOS 9/WinNT) that it is impossible to create 
a file named été (decomposed) in addition to a file named été (composed) 
in the same location.

JD


Re: character encoding on file upload name

2005-04-07 Thread Ken Williams
On Apr 7, 2005, at 3:04 PM, John Delacour wrote:
You're talking of file names, I suppose. I think you'll find that this 
is a function of the file system which stores file names in 
"decomposed" form, for what reason maybe someone else can tell you.
So that the OS can quickly compare filenames in a case-independent 
fashion.

 -Ken


Re: character encoding on file upload name

2005-04-08 Thread David Cantrell
On Thu, Apr 07, 2005 at 08:51:14PM -0500, Ken Williams wrote:
> On Apr 7, 2005, at 3:04 PM, John Delacour wrote:
> >You're talking of file names, I suppose. I think you'll find that this 
> >is a function of the file system which stores file names in 
> >"decomposed" form, for what reason maybe someone else can tell you.
> So that the OS can quickly compare filenames in a case-independent 
> fashion.

What an odd notion!  Why would anyone want to do that? ;-)

It would also be useful for things like searching for filenames ignoring
all those silly accents that I can't type.

-- 
David Cantrell | Official London Perl Mongers Bad Influence

The Carthaginian Peace is not what Carthage did, but what was done unto
Carthage - total destruction and elimination of its power, the sale of
its people into slavery, salting the fields, burning the city to the
ground and saying rude things about it on Radio Romulus.
-- Alison Brooks


Re: character encoding on file upload name

2005-04-08 Thread Randy Boring
>I've noticed that the non-ASCII characters are getting split into their 
>base code points.  For example, U+00E9, Latin small letter E with 
>acute, becomes U+0065 U+0301 (unicode.org/charts/PDF/U0080.pdf).  Is 
>there a way to easily recombine the code points to get the original 
>value?  It's strange to me that Encode::decode_utf8 doesn't do this.  I 
>thought diacritical marks were always combined with their preceding 
>letter, if possible.
>
>Andrew

You've run into the particular format of HFS+ filenames.  It's not just 
any utf-8 encoding, most all of the Unicode characters that are 
decomposable are decomposed, and must be so!

In Apple's header files (CoreFoundation/CFStringEncodingExt.h), it's 
referred to as kUnicodeCanonicalDecompVariant.
In NSString.h there are functions for 
decomposedStringWithCanonicalMapping (and precomposed- and 
-CompatabilityMapping).  How you get to them from Perl, tho maybe 
CamelBones?

A description of this text encoding (and the reason for it) are found at
  http://developer.apple.com/technotes/tn/tn1150.html

see especially
  http://developer.apple.com/technotes/tn/tn1150.html#HFSPlusNames
and
  http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties


Hope that helps a little,

 -Randy