On Sep 24, 2007, at 10:45 PM, Robert O'Callahan wrote:

On 9/23/07, Maciej Stachowiak <[EMAIL PROTECTED]> wrote:
Obviously, if the way to get the contents as text requires providing
the encoding, then it has to be a method. My comment was about the no-
argument methods. But you have a point that reading from disk is not a
simple get operation. Probably the methods should have names based on
read or the like (read(), readAsText(), etc) to indicate this. Also,
they should arguably be asynchronous since reading from the disk can
be slow, especially for large files, and it is undesirable to block
the main thread.

For small files, synchronous reading is OK. Perhaps there should be a separate whiz-bang asynchronous API ... it could support partial reads too.

What kind of file is small enough is a matter of judgment and depends on device performance characteristics. I tried the following experiment to estimate how much time could be taken by synchronous cold reads of a moderate number of files (assuming multi-file support in <input type="file"> and naiive use of the synchronous read API):

$ time cat ~/Pictures/*.jpg > /dev/null

real    0m1.135s
user    0m0.007s
sys     0m0.076s

This is on a pretty fast machine with a local filesystem. I have 76 .jpg files totaling about 19M in size. 1.13 seconds seems like an unacceptable length of time to block the UI, and it could easily be much worse for, say, a batch photo upload or an upload of a moderately large video file.

So I suspect that, much like synchronous XMLHttpRequest, synchronous file reads will lead to excessive UI lockups in bad circumstances unanticipated by the app author.

Also, I'm not sure how a web app can be expected to know the encoding
of a text file on disk.

The same way that any other app does --- guess based on the extension and expected usage? --- now that we've all standardized on meta-data-less file systems :-(. I suppose an app could examine the first chunk of the file and then re-read the file with a better guess.

The OS and the UA can often make a better guess, so I think the option to let the UA decide the encoding should at least be provided. Here are some sources of info that the UA has but the web app doesn't (at least without doing a separate binary read of the file first and possibly significant computation):

1) OS-level metadata, as for example in Mac OS X:
$ xattr -l plan.txt
com.apple.TextEncoding: UTF-8;134217984

2) Checking for a BOM.

3) Heuristics for specific file types, like looking for <meta charset> in HTML files or the encoding pseudo-attribute in an XML declaration.

4) General character set autodetection algorithms through statistical methods or similar.

5) Knowledge of the user's locale (useful for some legacy systems where default text encoding is determined by locale).

6) Knowledge of platform encoding conventions.

Regards,
Maciej

Reply via email to