btw. you catched a vespiary - usign the '%' as valid filename character turns out to be a problem through all archive like filesystem providers (tar, zip, ..). Also the FileObject.getName().getURI() didnt correctly encode the path i.e. one cant use its result to resolve a file again. I have to investigate this in more detail.

Already one year ago when I was fiddling with classloaders I found
out how URL encoding (eg. in URLClassLoader) is completely flawed
in Java. There are bug reports about this in java.sun.com but the official
answer to this seems to be that the way URL encoding is done now
is too central to be changed since big software has been written that
assume that URLs are encoded wrongly.
Therefore encode the file path to URL in vfs. It's not hard and
it is the only way.
This theme brings up an interesting topic about the set of characters that
are allowed to appear in file name. As we know the set of prohibited characters
on different operating systems is - well different.
Since vfs is cross-platform file-system it should define it's own set of
prohibited characters. Maybe union of prohibited characters on win/unix/mac.
But that is impossible since it will find files on unix that do have characters
that are prohibited - say on windows.
Maybe FileSystemProvider when instantiated has to be able to tell which
characters are allowed. Of course vfs can be completely neutral about the issue
and let the os / network protocol tell that something is wrong when illegal filename
was used. Nevertheless it would be excellent to document these kinds of issues
as part of the vfs project. Then it would be easier also to say for sure which
characters need to be encoded for URL.
Also I think decodeURI and encodeURI should be symmetrical.
Maybe we don't need to know anything about filenames.
We only need to know about URI.
What is the set of characters that need to be encoded in URI.
Well let's see RFC 2396


/reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","/

These are reserved characters because they have a special meaning in URI
They work as delimiters between different components. and the schema
finally decides if they are delimiters or not (I think)
They should be escaped but note:

/2.4.2. When to Escape and Unescape

A URI is always in an "escaped" form, since escaping or unescaping a
completed URI might change its semantics. Normally, the only time
escape encodings can safely be made is when the URI is being created
from its component parts; each component may have its own set of
characters that are reserved, so only the mechanism responsible for
generating or interpreting that component can determine whether or
not escaping a character will change its semantics. Likewise, a URI
must be separated into its components before the escaped characters
within those components can be safely decoded./

So when I have a path like /foo/%bar I should encode % but not /
Looking at the reserved character set in case of file: schema I
think none of them should be escaped.

/2.4.3. Excluded US-ASCII Characters
/
/control = <US-ASCII coded characters 00-1F and 7F hexadecimal>/
/space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">

The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URI in
text documents and protocol fields. The character "#" is excluded
because it is used to delimit a URI from a fragment identifier in URI
references (Section 4). The percent character "%" is excluded because
it is used for the encoding of escaped characters./

I think these should always be encoded in URI

There exists also unwise characters

/Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are
used as delimiters.

unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
/
But I don't think these should be encoded.

So all in all for file URI schema I think the characters to encode are:

<>*control = <US-ASCII coded characters 00-1F and 7F hexadecimal>*
<>*space = <US-ASCII coded character 20 hexadecimal>*
<>*delims = "<" | ">" | "#" | "%" | <">*
<>
<>On my Linux I can create directory
<>/<>#%"/
I just need write mkdir \<\>#%\"
Also it has happened to me that a program has created
a file name that contains newlines and some other non-printable
characters.

Copying this folder to some other os would result (probably)
in exception.
//


If I could I would assign you 12 points (the maximium) for catching this problem ;-)

Why can't you ?-)

- rami

Reply via email to