Re: File API: File's name property

2013-09-06 Thread Glenn Maynard
On Fri, Sep 6, 2013 at 10:42 AM, Anne van Kesteren wrote:

> If the raw input to the URL parser includes a backslash, it'll be
> treated as a forward slash. I am not really expecting people to use
> encodeURI or such utilities.
>

People who don't will have a bug, but all this is doing is preemptively
adding the bug, not preventing it, and forcing it on unrelated features
(HTMLInputElement.files).  Don't the ZIP URL proposals require some
characters or other to be escaped anyway (at least of the ones that support
navigation)?

It's far too late to try to keep people from having to escape things in
URLs.

 > Having a separate field is fine.  This is specific to ZIPs, so it feels
> like
> > it belongs in a ZipFile subclass, not File itself.
>
> Is it? There's no other file systems where the file names are
> effectively byte sequences? If that's the case, maybe that's fine.
>

There are lots of them.  I meant that it seems like wanting to expose raw
bytes is specific to ZIPs.  I hope we wouldn't expose the user's local
filesystem locale to the Web.  Depending on the user's locale causes some
of the more obnoxious bugs the platform has, we should be fighting to kill
it, not add more of it.


>  > We definitely wouldn't
> > want raw bytes from filenames being filled in from user filesystems (eg.
> > Shift-JIS filenames in Linux),
>
> The question is whether you can have something random without
> associated encoding. If there's an encoding it's easy to put lipstick
> on a pig.
>

You can have filenames in Linux that are in a different encoding than
expected.  I don't know why you'd want to expose that to the web, though.


>  >> There's an API too.
> >
> > It might be better to wait until we have a filesystem API, then
> piggyback on
> > that...
>
> Yeah, I wondered about that. It depends on whether we want to expose
> directories or just treat a zip archive as an ordered map of
> path/resource pairs.
>

I've found being able to work with a directory or a ZIP in the same way to
be useful in the past, too.


On Fri, Sep 6, 2013 at 12:08 PM, Anne van Kesteren  wrote:

> Actually, given that zip paths are byte sequences, that would not work
> anyway. The alternative might be to always map it to code points
> somehow via requiring an encoding to be specified and just deal with
> the losses, but that doesn't seem general purpose enough.
>

Taking an arbitrary use case: showing the user a list of files inside a
ZIP, and letting him pick one to be extracted.  Exposing raw filenames is
one way to make this work: you iterate over Files in the ZIP, pull out the
File.name for display to the user and stash the File.rawName so you can
look up the File later.  Once the user picks a file from the list, you call
zip.getFileByRawName(stashedRawName) with the associated rawName to
retrieve the selected file.

But, that doesn't "just work".  I assume the API will have a
"getFileByName(DOMString filename)"-like method as well as a rawName
method, and people will be much more likely to ignore byRawName and only
use byName.  The developer has to be careful to store the rawName and only
look up files using raw names if he wants broken filenames to work.

An alternative solution: as you iterate over Files to create a list to
display to the user, stash the File as well (instead of the rawName),
associated with each list entry.  When the user selects a file, you just
use the File you already have, and never pass the filename back to the
API.  This would also take special effort by developers, but no more than
the rawName solution, and it avoids exposing raw filenames entirely.

For ZIP URLs, it seems like linking inside a legacy ZIP (rather than a ZIP
of icons or whatever that you just created to link to) would be uncommon.
(Also, if you think people won't escape backslashes, they definitely won't
escape garbage filenames with a special byte-escape mechanism...)  Are
there likely use cases here?


On Fri, Sep 6, 2013 at 1:04 PM, Arun Ranganathan  wrote:

> I think it may be ok to restrict "/" and "\".  I don't think we lose too
> much here by not allowing historically "directory delimiting" characters in
> file names.
>

"\" is a valid character in real filenames.  This would break selecting
filenames with backslashes in them with HTMLInputElement, which works fine
today.

-- 
Glenn Maynard


Re: File API: File's name property

2013-09-06 Thread Anne van Kesteren
On Fri, Sep 6, 2013 at 4:42 PM, Anne van Kesteren  wrote:
> On Wed, Sep 4, 2013 at 11:45 PM, Glenn Maynard  wrote:
>> It might be better to wait until we have a filesystem API, then piggyback on
>> that...
>
> Yeah, I wondered about that. It depends on whether we want to expose
> directories or just treat a zip archive as an ordered map of
> path/resource pairs.

Actually, given that zip paths are byte sequences, that would not work
anyway. The alternative might be to always map it to code points
somehow via requiring an encoding to be specified and just deal with
the losses, but that doesn't seem general purpose enough.


-- 
http://annevankesteren.nl/



Re: File API: File's name property

2013-09-06 Thread Arun Ranganathan

On Sep 6, 2013, at 11:42 AM, Anne van Kesteren wrote:

> On Wed, Sep 4, 2013 at 11:45 PM, Glenn Maynard  wrote:
>> On Tue, Sep 3, 2013 at 12:04 PM, Anne van Kesteren  wrote:
>>> The problem is that once you put it through the URL parser it'll
>>> become "/". And I suspect given directory APIs and such it'll go
>>> through that layer at some point.
>> 
>> I don't follow.  Backslashes in filenames are escaped in URLs
>> (http://zewt.org/~glenn/test%5Cfile), like all the other things that require
>> escaping.
> 
> If the raw input to the URL parser includes a backslash, it'll be
> treated as a forward slash. I am not really expecting people to use
> encodeURI or such utilities.


I think it may be ok to restrict "/" and "\".  I don't think we lose too much 
here by not allowing historically "directory delimiting" characters in file 
names.

The question is what to do with a "/"  or a "\".   I'm inclined to say UAs 
should treat those as U+FFFD.

> 
>>> Well, my suggestion was rawName and name (which would have loss of
>>> information), per the current zip archive API design.
>> 
>> Having a separate field is fine.  This is specific to ZIPs, so it feels like
>> it belongs in a ZipFile subclass, not File itself.
> 
> Is it? There's no other file systems where the file names are
> effectively byte sequences? If that's the case, maybe that's fine.


Well…. 

Some file systems don't store names as unrestricted byte sequences (older 
Windows), but GNU systems usually do.  Some byte sequences are not valid names. 
Conversely, names of existing files may not be representable as byte sequences 
(and sometimes there are two representations -- e.g. Amèlie.txt will either use 
00e9 or 0065 0031 for the è  -- both are Unicode equivalents, but are different 
byte sequences). Some file systems perform Unicode canonicalization on file 
names, which is more or less what I think the Web should do.

I think we run only a small risk of information loss, but I DO think that File 
name should be an [EnforceUTF16] DOMString.  That way, we have the best shot at 
byte sequences based on the underlying characterization.

Summary: I'll punt on File.rawName till a rainier day than today, but I will 
restrict "/" and "\" since they are historically directory separators.  I know 
that there are OTHER characters that we can also restrict, but these two are 
the big ones and get us some 80-20 sanitization :)

Glenn said:

>> It might be better to wait until we have a filesystem API, then piggyback on
>> that...

+1.

-- A*

Re: File API: File's name property

2013-09-06 Thread Anne van Kesteren
On Wed, Sep 4, 2013 at 11:45 PM, Glenn Maynard  wrote:
> On Tue, Sep 3, 2013 at 12:04 PM, Anne van Kesteren  wrote:
>> The problem is that once you put it through the URL parser it'll
>> become "/". And I suspect given directory APIs and such it'll go
>> through that layer at some point.
>
> I don't follow.  Backslashes in filenames are escaped in URLs
> (http://zewt.org/~glenn/test%5Cfile), like all the other things that require
> escaping.

If the raw input to the URL parser includes a backslash, it'll be
treated as a forward slash. I am not really expecting people to use
encodeURI or such utilities.


>> Well, my suggestion was rawName and name (which would have loss of
>> information), per the current zip archive API design.
>
> Having a separate field is fine.  This is specific to ZIPs, so it feels like
> it belongs in a ZipFile subclass, not File itself.

Is it? There's no other file systems where the file names are
effectively byte sequences? If that's the case, maybe that's fine.


> We definitely wouldn't
> want raw bytes from filenames being filled in from user filesystems (eg.
> Shift-JIS filenames in Linux),

The question is whether you can have something random without
associated encoding. If there's an encoding it's easy to put lipstick
on a pig.


> and Windows filenames aren't even bytes
> (they're natively UTF-16).

Right, that would end up as a utf-8 byte sequence in File.rawName and
File.name would do the right thing with that.


>> There's an API too.
>
> It might be better to wait until we have a filesystem API, then piggyback on
> that...

Yeah, I wondered about that. It depends on whether we want to expose
directories or just treat a zip archive as an ordered map of
path/resource pairs.


-- 
http://annevankesteren.nl/



Re: File API: File's name property

2013-09-04 Thread Glenn Maynard
On Tue, Sep 3, 2013 at 12:04 PM, Anne van Kesteren wrote:

> The problem is that once you put it through the URL parser it'll
> become "/". And I suspect given directory APIs and such it'll go
> through that layer at some point.
>

I don't follow.  Backslashes in filenames are escaped in URLs (
http://zewt.org/~glenn/test%5Cfile), like all the other things that require
escaping.

 Well, my suggestion was rawName and name (which would have loss of
> information), per the current zip archive API design.
>

Having a separate field is fine.  This is specific to ZIPs, so it feels
like it belongs in a ZipFile subclass, not File itself.  We definitely
wouldn't want raw bytes from filenames being filled in from user
filesystems (eg. Shift-JIS filenames in Linux), and Windows filenames
aren't even bytes (they're natively UTF-16).


> > By the way, in the current ZIP URL proposal, where would a File be
> created?
> > If you use XHR to access a file inside a ZIP URL then you'd just get a
> Blob,
> > right?
>
> There's an API too.
>

It might be better to wait until we have a filesystem API, then piggyback
on that...

-- 
Glenn Maynard


Re: File API: File's name property

2013-09-03 Thread Arun Ranganathan

On Sep 3, 2013, at 12:28 PM, Anne van Kesteren wrote:

> On Tue, Sep 3, 2013 at 5:14 PM, Glenn Maynard  wrote:
>> On Tue, Sep 3, 2013 at 10:17 AM, Anne van Kesteren  wrote:
>>> I don't think you want those conversion semantics for name. I do think
>>> we want the value space for names across different systems to be
>>> equivalent, which if we support zip basically means bytes.
>> 
>> I don't really understand the suggestion of using a ByteString for
>> File.name.  Can you explain how that wouldn't break
>> https://zewt.org/~glenn/picker.html, if the user picks a file named
>> "漢字.txt"?
> 
> ByteString doesn't work. A byte sequence might. If the platform does
> file names in Unicode it would be converted to bytes using utf-8.


Which in fact is how I think we should do File.name.  We'll stick to DOMString, 
but think it should specify a conversion to a byte sequence using utf-8.  And, 
restrict separators such as "/" and "\".

-- A*


Re: File API: File's name property

2013-09-03 Thread Anne van Kesteren
On Tue, Sep 3, 2013 at 5:31 PM, Arun Ranganathan  wrote:
> Which in fact is how I think we should do File.name.  We'll stick to 
> DOMString, but think it should specify a conversion to a byte sequence using 
> utf-8.  And, restrict separators such as "/" and "\".

That doesn't solve the problem I mentioned earlier for arbitrary file
names coming out of zip archives. And then your data model is not
bytes, but Unicode scalar values. We could of course accept
information loss of some kind in the conversion process between zip
archive resources and File objects and require developers to keep
track of that if they care.


-- 
http://annevankesteren.nl/



Re: File API: File's name property

2013-09-03 Thread Glenn Maynard
On Tue, Sep 3, 2013 at 11:31 AM, Arun Ranganathan  wrote:

> And, restrict separators such as "/" and "\".
>

I thought we just agreed that "\" is a platform-specific thing that
File.name shouldn't restrict.  "/" is a directory separator on just about
every platform, but "\" can appear in filenames on many systems.

On Tue, Sep 3, 2013 at 11:28 AM, Anne van Kesteren wrote:

> ByteString doesn't work. A byte sequence might. If the platform does
> file names in Unicode it would be converted to bytes using utf-8.
>

I don't know what API is being suggested that would keep File.name acting
like a String, but also allow containing arbitrary bytes.  I could imagine
one (an object that holds bytes, stringifies assuming UTF-8 and converts
from strings assuming UTF-8), but that's pretty ugly...

On Tue, Sep 3, 2013 at 11:42 AM, Anne van Kesteren wrote:

> That doesn't solve the problem I mentioned earlier for arbitrary file
> names coming out of zip archives. And then your data model is not
> bytes, but Unicode scalar values. We could of course accept
> information loss of some kind in the conversion process between zip
> archive resources and File objects and require developers to keep
> track of that if they care.
>

If you want to retain the original bytes of the filename somewhere, it
seems like it should go somewhere other than File.name.  For example, a
subclass of File, ZipFile, could contain a ByteString filenameBytes with
the original filename.  I wonder when you'd need that info, though.

By the way, in the current ZIP URL proposal, where would a File be
created?  If you use XHR to access a file inside a ZIP URL then you'd just
get a Blob, right?

-- 
Glenn Maynard


Re: File API: File's name property

2013-09-03 Thread Anne van Kesteren
On Tue, Sep 3, 2013 at 5:54 PM, Glenn Maynard  wrote:
> On Tue, Sep 3, 2013 at 11:31 AM, Arun Ranganathan  wrote:
>> And, restrict separators such as "/" and "\".
>
> I thought we just agreed that "\" is a platform-specific thing that
> File.name shouldn't restrict.  "/" is a directory separator on just about
> every platform, but "\" can appear in filenames on many systems.

The problem is that once you put it through the URL parser it'll
become "/". And I suspect given directory APIs and such it'll go
through that layer at some point.


> On Tue, Sep 3, 2013 at 11:28 AM, Anne van Kesteren  wrote:
>>
>> ByteString doesn't work. A byte sequence might. If the platform does
>> file names in Unicode it would be converted to bytes using utf-8.
>
> I don't know what API is being suggested that would keep File.name acting
> like a String, but also allow containing arbitrary bytes.  I could imagine
> one (an object that holds bytes, stringifies assuming UTF-8 and converts
> from strings assuming UTF-8), but that's pretty ugly...

Well, my suggestion was rawName and name (which would have loss of
information), per the current zip archive API design.


> By the way, in the current ZIP URL proposal, where would a File be created?
> If you use XHR to access a file inside a ZIP URL then you'd just get a Blob,
> right?

There's an API too.


-- 
http://annevankesteren.nl/



Re: File API: File's name property

2013-09-03 Thread Anne van Kesteren
On Tue, Sep 3, 2013 at 5:14 PM, Glenn Maynard  wrote:
> On Tue, Sep 3, 2013 at 10:17 AM, Anne van Kesteren  wrote:
>> I don't think you want those conversion semantics for name. I do think
>> we want the value space for names across different systems to be
>> equivalent, which if we support zip basically means bytes.
>
> I don't really understand the suggestion of using a ByteString for
> File.name.  Can you explain how that wouldn't break
> https://zewt.org/~glenn/picker.html, if the user picks a file named
> "漢字.txt"?

ByteString doesn't work. A byte sequence might. If the platform does
file names in Unicode it would be converted to bytes using utf-8.


-- 
http://annevankesteren.nl/



Re: File API: File's name property

2013-09-03 Thread Glenn Maynard
On Tue, Sep 3, 2013 at 9:03 AM, Arun Ranganathan  wrote:

> It wouldn't be wise to restrict '/' or '\' or try to delve too deep into
> platform land BUT the FileSystem API introduces directory syntax which
> might make being lax a fly in the ointment for later.
>

I wouldn't object to restricting "/" if it'll make other APIs more
sensible.  Every platform I've used treats it as a separator.

On Tue, Sep 3, 2013 at 10:17 AM, Anne van Kesteren wrote:

> I don't think you want those conversion semantics for name. I do think
> we want the value space for names across different systems to be
> equivalent, which if we support zip basically means bytes.


I don't really understand the suggestion of using a ByteString for
File.name.  Can you explain how that wouldn't break
https://zewt.org/~glenn/picker.html, if the user picks a file named
"漢字.txt"?

-- 
Glenn Maynard


Re: File API: File's name property

2013-09-03 Thread Anne van Kesteren
On Tue, Sep 3, 2013 at 3:03 PM, Arun Ranganathan  wrote:
> Well, https://www.w3.org/Bugs/Public/show_bug.cgi?id=23138 is to make the 
> 'type' attribute a ByteString.  Is that your request here for the name 
> attribute as well?

I don't think you want those conversion semantics for name. I do think
we want the value space for names across different systems to be
equivalent, which if we support zip basically means bytes. This could
mean accepting DOMString and then doing the conversion yourself
through utf-8. However, it's not very clear to me how to do the
conversion back in a way that minimizes information loss and works
everywhere compatibly. For zip archives I ended up with rawPath
(bytes) and path (bytes converted to a string using utf-8 and vice
versa). Maybe we should use that model here too?


> It wouldn't be wise to restrict '/' or '\' or try to delve too deep into 
> platform land BUT the FileSystem API introduces directory syntax which might 
> make being lax a fly in the ointment for later.

Right. Zip archives also have paths and it would be annoying if we ran
into problems there.


-- 
http://annevankesteren.nl/



Re: File API: File's name property

2013-09-03 Thread Arun Ranganathan
Well, https://www.w3.org/Bugs/Public/show_bug.cgi?id=23138 is to make the 
'type' attribute a ByteString.  Is that your request here for the name 
attribute as well?

It wouldn't be wise to restrict '/' or '\' or try to delve too deep into 
platform land BUT the FileSystem API introduces directory syntax which might 
make being lax a fly in the ointment for later.


On Aug 29, 2013, at 10:48 AM, Anne van Kesteren wrote:

> As currently specified File's name property seems to be a code unit
> sequence. In zip archives the resource's path is a byte sequence. I
> don't really know what popular file systems do. Given that a File has
> to be transmitted over the wire now and then, including it's name
> property value, a code unit sequence seems like the wrong type. It
> would at least lead to information loss which I'm not sure is
> acceptable if we can prevent it (or at least make it more obvious that
> it is going on, by doing a transformation early on).
> 
> We may also want to restrict "\" and "/" to leave room for using these
> objects in path-based contexts later.
> 
> 
> -- 
> http://annevankesteren.nl/
> 




Re: File API: File's name property

2013-08-29 Thread Glenn Maynard
On Thu, Aug 29, 2013 at 10:51 AM, Anne van Kesteren wrote:

> On Thu, Aug 29, 2013 at 4:46 PM, Glenn Maynard  wrote:
> > All constructing a File does is give a name (and date) to a Blob.  It
> > doesn't create an association to an on-disk file, and shouldn't be
> > restricted to filenames the local platform's filesystem can represent.
>
> Yes, but it can be submitted to a server so it has to be transformed
> at some point. It seems way better to do the transformation early so
> what you see in client-side JavaScript is similar to what you'd see in
> Node.js.
>

It's transformed from a UTF-16 DOMString to the encoding of the protocol
it's being transferred over, just like any other DOMString being sent over
a non-UTF-16 protocol.

> URL parsing does lots of weird things that shouldn't be spread to the rest
> > of the platform.  File.name and URL parsing are completely different
> things,
> > and filenames on non-Windows systems can contain backslashes.
>
> All the more reason to do something with it to prevent down-level bugs.
>

We shouldn't prevent people in Linux from seeing their filenames because
those filenames wouldn't be valid on Windows.  That would require much more
than just backslashes--you'd need to prevent all characters and strings
that aren't valid in Windows, such as "COM0".

Even having non-ASCII filenames will cause problems for Windows users,
since many Windows applications can only access filenames which are a
subset of the user's locale (it takes extra work to use Unicode filenames
in Windows).

-- 
Glenn Maynard


Re: File API: File's name property

2013-08-29 Thread Anne van Kesteren
On Thu, Aug 29, 2013 at 4:46 PM, Glenn Maynard  wrote:
> All constructing a File does is give a name (and date) to a Blob.  It
> doesn't create an association to an on-disk file, and shouldn't be
> restricted to filenames the local platform's filesystem can represent.

Yes, but it can be submitted to a server so it has to be transformed
at some point. It seems way better to do the transformation early so
what you see in client-side JavaScript is similar to what you'd see in
Node.js.


>> Given that the URL parser treats them identically, we should treat
>> them identically everywhere else too.
>
> URL parsing does lots of weird things that shouldn't be spread to the rest
> of the platform.  File.name and URL parsing are completely different things,
> and filenames on non-Windows systems can contain backslashes.

All the more reason to do something with it to prevent down-level bugs.


-- 
http://annevankesteren.nl/



Re: File API: File's name property

2013-08-29 Thread Glenn Maynard
On Thu, Aug 29, 2013 at 10:14 AM, Anne van Kesteren wrote:

> On Thu, Aug 29, 2013 at 4:10 PM, Glenn Maynard  wrote:
> > I don't think it makes sense to expect filenames to round-trip through
> > File.name, especially for filenames with a broken or unknown encoding.
> > File.name should be a best-effort at converting the platform filename to
> > something that can be displayed to users or encoded and put in a
> > Content-Disposition header, not an identifier for finding the file later.
>
> File has a constructor. We should be clearer about platforms too I suppose.
>

All constructing a File does is give a name (and date) to a Blob.  It
doesn't create an association to an on-disk file, and shouldn't be
restricted to filenames the local platform's filesystem can represent.

Given that the URL parser treats them identically, we should treat
> them identically everywhere else too.
>

URL parsing does lots of weird things that shouldn't be spread to the rest
of the platform.  File.name and URL parsing are completely different
things, and filenames on non-Windows systems can contain backslashes.

-- 
Glenn Maynard


Re: File API: File's name property

2013-08-29 Thread Anne van Kesteren
On Thu, Aug 29, 2013 at 4:10 PM, Glenn Maynard  wrote:
> I don't think it makes sense to expect filenames to round-trip through
> File.name, especially for filenames with a broken or unknown encoding.
> File.name should be a best-effort at converting the platform filename to
> something that can be displayed to users or encoded and put in a
> Content-Disposition header, not an identifier for finding the file later.

File has a constructor. We should be clearer about platforms too I suppose.


>> We may also want to restrict "\" and "/" to leave room for using these
>> objects in path-based contexts later.
>
> Forward slash, but not backslash.  That's a platform-specific restriction.
> If we go down the route of limiting filenames which don't work on one or
> another system, the list of restrictions becomes very long.  If path
> separators are exposed on the web, they should always be forward-slashes.

Given that the URL parser treats them identically, we should treat
them identically everywhere else too.


-- 
http://annevankesteren.nl/



Re: File API: File's name property

2013-08-29 Thread Glenn Maynard
On Thu, Aug 29, 2013 at 9:48 AM, Anne van Kesteren  wrote:

> As currently specified File's name property seems to be a code unit
> sequence. In zip archives the resource's path is a byte sequence. I
> don't really know what popular file systems do. Given that a File has
> to be transmitted over the wire now and then, including it's name
> property value, a code unit sequence seems like the wrong type. It
> would at least lead to information loss which I'm not sure is
> acceptable if we can prevent it (or at least make it more obvious that
> it is going on, by doing a transformation early on).
>

I don't think it makes sense to expect filenames to round-trip through
File.name, especially for filenames with a broken or unknown encoding.
File.name should be a best-effort at converting the platform filename to
something that can be displayed to users or encoded and put in a
Content-Disposition header, not an identifier for finding the file later.

We may also want to restrict "\" and "/" to leave room for using these
> objects in path-based contexts later.
>

Forward slash, but not backslash.  That's a platform-specific restriction.
If we go down the route of limiting filenames which don't work on one or
another system, the list of restrictions becomes very long.  If path
separators are exposed on the web, they should always be forward-slashes.

-- 
Glenn Maynard


File API: File's name property

2013-08-29 Thread Anne van Kesteren
As currently specified File's name property seems to be a code unit
sequence. In zip archives the resource's path is a byte sequence. I
don't really know what popular file systems do. Given that a File has
to be transmitted over the wire now and then, including it's name
property value, a code unit sequence seems like the wrong type. It
would at least lead to information loss which I'm not sure is
acceptable if we can prevent it (or at least make it more obvious that
it is going on, by doing a transformation early on).

We may also want to restrict "\" and "/" to leave room for using these
objects in path-based contexts later.


-- 
http://annevankesteren.nl/