Very interesting read, that opens a whole new can of worms. How should we
behave when we actually read file names from the filesystem.

As for the path literal the newest revision of S32-setting-library should make
most people happy as the default is OS independent and abstract. More
strictness can be set with use flags or more verbose syntax, this should also
make it easier to make portable programmes in Perl 6. So far I'm quite happy
with the current result, way to go people :)

But what should we do when reading path's from the filesystem is still
a problem.

We can go the old Perl 5 way of treating filenames as binary by default and
then trying to convert it based on local encoding settings.

But this just mean any sane program will have to do an explicit, decoding to a
Unicode path or string.

Like we do in Perl 5:

my $file = readdir $dir;
$decoded_file = eval { decode("utf8", $file, Encode::FB_CROAK); };
if($@) {
  # Try something else as this was clearly not utf8.
} else {
  $file = $decoded_file;
}

But then again is this reasonable, on both Windows and MacOS X we know exactly
what we get as the filesystem will tell us. Even FAT has an encoding attribute
telling us what encoding the filesystem is in. And given that the OS actually
refuses to write files that are not valid, it would be a safe bet that a Path
can be decoded with that encoding.

So the problem of knowing encoding really only exists on Unix/Linux. This is
mainly because As POSIX does not care about encoding and most filesystems seem
to follow. But who knows if future filesystems will still be so lax with input,
the current trend of putting more database features in the filesystem might
also bring some more input validation, and the future we might not have to deal
with the insanity of multiple encodings.

Apparently JFS today has the option of limiting file name encoding.

http://lwn.net/Articles/71472/

Even without a filesystem restriction, on Linux/Unix we have a default encoding
specified in the locale that most software will respect, so when I name a file
"ÆØÅ" on my Ubuntu box all my programs will show it as such and not give me a
garbled string. So even if we have no guaranty that file names are encoded in
what the locale is set to, it's the best information we have.

One could always argue that even if the filesystem restricts file name input,
one still have the option of ignoring this as one encoded string of bytes will
be valid under the rules of another encoding just with another meaning. But
this file name will be wrong in all other programs, so why should it be correct
or unspecified(as in just a stream of bytes) in Perl 6?

My idea of working with file names would be that we default to locale or
filesystem settings, but give the options of working with paths/file names as
binary or a specific encoding.

my $file = readdir $dir; # Default to locale settings. fx utf8

This will return a UTF8 encoded Path unless and if this fails, no decoding will
be done and we return a binary Path.

my $file = readdir $dir, :utf8; # Decodes as utf8

my $file = readdir $dir, :bin; # No decoding is done

The whole reason for this is paths and filenames should not be special, it's
just another form of user input, where we should have some sane default so it
does what we expect.

More reading on the topic:

Python 3 problems:
http://bugs.python.org/issue4006

Unicode handling in Linux:
http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html

Regards Troels.

On Wed, Aug 19, 2009 at 03:17, Timothy S. Nelson<wayl...@wayland.id.au> wrote:
>        See this link.
>
> http://archive.netbsd.se/?ml=perl6-language&a=2008-11&t=9170058
>
>        In particular, I thought Tom Christiansen's long message had some
> relevant info about filename literals.
>
>        :)
>
>
> ---------------------------------------------------------------------
> | Name: Tim Nelson                 | Because the Creator is,        |
> | E-mail: wayl...@wayland.id.au    | I am                           |
> ---------------------------------------------------------------------
>
> ----BEGIN GEEK CODE BLOCK----
> Version 3.12
> GCS d+++ s+: a- C++$ U+++$ P+++$ L+++ E- W+ N+ w--- V- PE(+) Y+>++ PGP->+++
> R(+) !tv b++ DI++++ D G+ e++>++++ h! y-
> -----END GEEK CODE BLOCK-----
>
>

Reply via email to