subject:"\[whatwg\] Archive API \- proposal"

Re: [whatwg] Archive API - proposal

2012-08-16 Thread Jonas Sicking

On Wed, Aug 15, 2012 at 9:38 PM, Glenn Maynard gl...@zewt.org wrote:
 On Wed, Aug 15, 2012 at 10:10 PM, Jonas Sicking jo...@sicking.cc wrote:

 Though I still think that we should support reading out specific files
 using a filename as a key. I think a common use-case for ArchiveReader
 is going to be web developers wanting to download a set of resources
 from their own website and wanting to use a .zip file as a way to get
 compression and packaging. In that case they can easily either ensure
 to stick with ASCII filenames, or encode the names in UTF8.


 That's what this was for:


 // For convenience, add getter File? (DOMString name) to FileList, to
 find a file by name.  This is equivalent
 // to iterating through files[] and comparing .name.  If no match is
 found, return null.  This could be a function
 // instead of a getter.
 var example_file2 = zipFile.files[file.txt];
 if(example_file2 == null) { console.error(file.txt not found in ZIP;
 return; }

 I suppose a named getter isn't a great idea--you might have a filename
 length--so a zipFile.files.find('file.txt') function is probably better.

I definitely wouldn't want to use a getter. That runs into all sorts
of problems and the syntactical wins are pretty small.

 One way we could support this would be to have a method which allows
 getting a list of meta-data about each entry. Probably together with
 the File object itself. So we could return an array of objects like:

 [ {
 rawName: UInt8Array,
 file: File object,
 crc32: UInt8Array
   },
   {
 rawName: UInt8Array,
 file: File object,
 crc32: UInt8Array
   },
   ...
 ]

 That way we can also leave out the crc from archive types that doesn't
 support it.

 This means exposing two objects per file.  I'd prefer a single File-subclass
 object per file, with any extra metadata put on the subclass.

First of all, we're be talking about 5 vs. 6 objects per file entry:
two ArrayBuffers, two ArrayBufferViews, one File and potentially one
JS-object. Actually, in Gecko it's more like 8 vs. 9 objects once you
start counting the C++ objects and their JS-wrappers.

Second, at least in the Gecko engine, allocating the first 5 objects
take about three orders of magnitude more time than allocating the
JS-object.

I'm also not a fan of sticking the crc32 on the File object itself
since we don't actually know that that's the correct crc32 value.

 But I like this approach a lot of we can make it work. The main thing
 I'd be worried about, apart from the IO performance above, is if we
 can make it work for a larger set of archive formats. Like, can we
 make it work for .tar and .tar.gz? I think we couldn't but we would
 need to verify.

 It wouldn't handle it very well, but the original API wouldn't, either.  In
 both cases, the only way to find filenames in a TAR--whether it's to search
 for one or to construct a list--is to scan through the whole file (and
 decompress it all, for .tgz).  Simply retrieving a list of filenames from a
 large .tgz would thrash the user's disk and chew CPU.

 I don't think there's much use in supporting .tar, anyway.  Even if you want
 true streaming (which would be a different API anyway, since we're reading
 from a Blob here), ZIP can do that too, by using the local file headers
 instead of the central directory.

The main argument that I could see is that the initial proposal
allowed extracting files from a .tar.gz while only extracting up to
the point of finding the file-to-be-extracted. As long as
.getFileNames wasn't called. Which I'll grant isn't a huge benefit.

/ Jonas

Re: [whatwg] Archive API - proposal

2012-08-16 Thread Glenn Maynard

On Thu, Aug 16, 2012 at 1:22 AM, Jonas Sicking jo...@sicking.cc wrote:

 First of all, we're be talking about 5 vs. 6 objects per file entry:
 two ArrayBuffers, two ArrayBufferViews, one File and potentially one
 JS-object. Actually, in Gecko it's more like 8 vs. 9 objects once you
 start counting the C++ objects and their JS-wrappers.


That's not what I meant.  It looked like you meant passing two arrays to
onsuccess, one with metadata and one with Files, so the user would have to
reassociate them.  Rereading I see that's not what you meant.

That said, these can be methods, so the ArrayBuffers aren't allocated
unless the user wants them, which I expect would be rare:

interface ZipFile : File {
ArrayBuffer getErrorVerificationCode();
readonly attribute DOMString errorVerificationMethod; // always CRC32
for now
ArrayBuffer getRawFilename();
};

(If all we care about is CRC32, then readonly attribute unsigned long
expectedCRC32 instead and drop errorVerificationMethod.  I'm assuming
non-CRC32 is what you had in mind by making CRC32 an ArrayBuffer instead of
just an unsigned long.)

I'm also not a fan of sticking the crc32 on the File object itself
 since we don't actually know that that's the correct crc32 value.


It's the expected CRC32, not the CRC32, and should have an attribute
name to that effect.  It definitely doesn't belong on File itself, since
it's pretty tightly specific to archive error checking; it should use a
subclass.

-- 
Glenn Maynard

Re: [whatwg] Archive API - proposal

2012-08-15 Thread Henri Sivonen

On Tue, Aug 14, 2012 at 11:20 PM, Glenn Maynard gl...@zewt.org wrote:
 On Tue, Jul 17, 2012 at 9:23 PM, Andrea Marchesini b...@mozilla.com wrote:

 // The getFilenames handler receives a list of DOMString:
 var handle = this.reader.getFile(this.result[i]);

 This interface is problematic.  Since ZIP files don't have a standard
 encoding, filenames in ZIPs are often garbage.  This API requires that
 filenames round-trip uniquely, or else files aren't accessible t all.

Indeed, in the case of zip files, file names themselves are dangerous
as handles that get past passed back and forth, so it seems like a
good idea to be able to extract the contents of a file inside the
archive without having to address the file by name.

As for the filenames, after an off-list discussion, I think the best
solution is that UTF-8 is tried first but the ArchiveReader
constructor takes an optional second argument that names a character
encoding from the Encoding Standard. This will be known as the
fallback encoding. If no fallback encoding is provided by the caller
of the constructor, Windows-1252 is set as the fallback encoding.
When it ArchiveReader processes a filename from the zip archive, it
first tests if the byte string is a valid UTF-8 string. If it is, the
byte string is interpreted as UTF-8 when converting to UTF-16. If the
filename is not a valid UTF-8 string, it is decoded into UTF-16 using
the fallback encoding.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/

Re: [whatwg] Archive API - proposal

2012-08-15 Thread David Benjamin

On Wed, Aug 15, 2012 at 7:24 AM, Andrea Marchesini b...@mozilla.com wrote:
 Thanks for your feedback.

 When I was implementing the ArchiveAPI, my idea was to have a generic Archive 
 API and not just a ZIP API.
 Of course the current implementation supports just ZIP but in the future we 
 could have support for more formats.

What other sorts of archive formats were you thinking of supporting?
Apart from archive-specific features like the CRC32, different formats
can be read in different ways. A tarball, for instance, can't be read
out-of-order easily. It's literally the files concatenated together
with a header before each. The headers tell you the size of each file,
so you can seek over the data, but you still have to jump across the
entire file sequentially to find a particular file. (Though I suppose
you could build a table once when the file's loaded.) A gzipped
tarball is even worse since the entire stream is compressed, so you
have to decompress it to hop around.

Do you know how this compares to a JavaScript library implementation
with typed arrays and whatnot?

David

Re: [whatwg] Archive API - proposal

2012-08-15 Thread Glenn Maynard

On Wed, Aug 15, 2012 at 6:14 AM, Henri Sivonen hsivo...@iki.fi wrote:

 As for the filenames, after an off-list discussion, I think the best
 solution is that UTF-8 is tried first but the ArchiveReader
 constructor takes an optional second argument that names a character
 encoding from the Encoding Standard. This will be known as the
 fallback encoding. If no fallback encoding is provided by the caller
 of the constructor, Windows-1252 is set as the fallback encoding.
 When it ArchiveReader processes a filename from the zip archive, it
 first tests if the byte string is a valid UTF-8 string. If it is, the
 byte string is interpreted as UTF-8 when converting to UTF-16. If the
 filename is not a valid UTF-8 string, it is decoded into UTF-16 using
 the fallback encoding.


This would misinterpret filenames as UTF-8.  For example, 黴雨.jpg in a
CP932 (SJIS) ZIP is also legal UTF-8.  This would happen even though the
user explicitly specified an encoding, and even though UTF-8 is
exceptionally rare in ZIPs (all Windows ZIP software outputs filenames in
the user's ACP, and many don't support UTF-8 at all).

On Wed, Aug 15, 2012 at 6:17 AM, Andrea Marchesini
amarches...@mozilla.comwrote:

 I agree. I was thinking that the default encoding for filenames is:
 UTF-8. If filename is not a valid UTF-8 string we can use the
 caller-supplied encoding:


I hate to argue against defaulting to UTF-8, but very few ZIPs are actually
UTF-8.  CP1252 as a default will at least often be correct, but UTF-8 will
almost never be.  (The only straightforward way I know to create a ZIP with
UTF-8 filenames is with a *nix commandline client, and most Windows
software won't understand it.)

var reader = new ArchiveReader(blob, Windows-1252);

 If this fails, this filename/file will be excluded from the results.


There's no need.  Decode with proper error handling, as specified in the
Encoding spec: http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html.
This will give placeholder characters (U+FFFD); even if the whole filename
comes out unreadable, the file can still be read, selected from a list,
shown in a thumbnail view, and so on.  Lots of uses aren't dependant on
filenames.


   It should be possible to get the CRC32 of files, which ZIP stores in
  the central directory. This both allows the user to perform checksum
  verification himself if wanted, and all the other variously useful
  things about being able to get a file's checksum without having to
  read the whole file.

 can we have 'generic' archive API supporting CRC32?


Do you actually have any concrete plans for other archive formats?  The
only others commonly used are TAR and RAR.  TAR is unsuitable for
non-archive use (you have to scan the whole file to construct a file list),
and RAR is proprietary.

You could design a checksum API that uses the algorithm for a particular
format, but that's severe overdesign if it never supports anything but
ZIP.  I wouldn't worry about this.

-- 
Glenn Maynard

Re: [whatwg] Archive API - proposal

2012-08-15 Thread Jonas Sicking

On Tue, Aug 14, 2012 at 1:20 PM, Glenn Maynard gl...@zewt.org wrote:
 (I've reordered my responses to give a more logical progression.)

 On Tue, Jul 17, 2012 at 9:23 PM, Andrea Marchesini b...@mozilla.com wrote:

 // The getFilenames handler receives a list of DOMString:
 var handle = this.reader.getFile(this.result[i]);


 This interface is problematic.  Since ZIP files don't have a standard
 encoding, filenames in ZIPs are often garbage.  This API requires that
 filenames round-trip uniquely, or else files aren't accessible t all.  For
 example, if you have two filenames in CP932, 日 and 本, but the encoding
 isn't determined correctly, you may end up with two files both with a
 filename of ??.  Either you can't open either file, or you can only open
 one of them.  This isn't theoretical; I hit ZIP files like this in the wild
 regularly.

 Instead, I'd recommend that the primary API simply returns File objects
 directly from the ZIP.  For example:

 var reader = archive.getFiles();
 reader.onsuccess = function(result) {
 // result = [File, File, File, File...];

 console.log(result[0].name);
 // read the file
 new FileReader(result[0]);
 }

 This allows opening files without any dependency on the filename.  Since
 File objects are by design lightweight--no decompression should happen
 until you actually read from the file--this isn't expensive and won't
 perform any extra I/O.  All the information you need to expose a File
 object is in the central directory (filename, mtime, decompressed size).

This is a good idea. It neatly solves the problem of not having to
rely on filenames as keys.

Though I still think that we should support reading out specific files
using a filename as a key. I think a common use-case for ArchiveReader
is going to be web developers wanting to download a set of resources
from their own website and wanting to use a .zip file as a way to get
compression and packaging. In that case they can easily either ensure
to stick with ASCII filenames, or encode the names in UTF8.

By allowing them to download a .zip file, they can also store that
.zip in compressed form in IndexedDB or the FileSystem API in order to
use less space on the user's device. (Additionally many times IO gets
faster by using .zip files because the time saved in doing less IO is
larger than the time spent decompressing. Obviously very dependent on
what data is being stored).

 . Do you think it can be useful?
 . Do you see any limitation, any feature missing?

 It should be possible to get the CRC32 of files, which ZIP stores in the
 central directory.  This both allows the user to perform checksum
 verification himself if wanted, and all the other variously useful things
 about being able to get a file's checksum without having to read the whole
 file.

One way we could support this would be to have a method which allows
getting a list of meta-data about each entry. Probably together with
the File object itself. So we could return an array of objects like:

[ {
rawName: UInt8Array,
file: File object,
crc32: UInt8Array
  },
  {
rawName: UInt8Array,
file: File object,
crc32: UInt8Array
  },
  ...
]

That way we can also leave out the crc from archive types that doesn't
support it.

Though I'm not convinced that CRCs are important enough that we need
to put it in the first iteration of the API.

 (I don't think CRC32 checks should be performed automatically, since it's
 too hard for that to make sense when random access is involved.)

I agree with this.

   // The ArchiveReader object works with Blob objects:
   var archiveReader = new ArchiveReader(file);

   // Any request is asynchronous:


 The only operation that needs to be asynchronous is creating the
 ArchiveReader itself.  It should parse the ZIP central record before before
 returning a result.  Once you've done that you can do the rest
 synchronously, because no further I/O is necessary until you actually read
 data from a file.

This is definitely an interesting idea. The current API is designed
around doing the IO when each individual operation is done. You are
proposing to do all IO up front which allows all operations to be
synchronous.

I suspect that doing the IO lazily can provide better performance
for some types of operations, such as only wanting to extract a single
resource from an archive. But maybe the difference wouldn't be that
big in most cases.

But I like this approach a lot of we can make it work. The main thing
I'd be worried about, apart from the IO performance above, is if we
can make it work for a larger set of archive formats. Like, can we
make it work for .tar and .tar.gz? I think we couldn't but we would
need to verify.

/ Jonas

Re: [whatwg] Archive API - proposal

2012-08-15 Thread Glenn Maynard

On Wed, Aug 15, 2012 at 10:10 PM, Jonas Sicking jo...@sicking.cc wrote:

 Though I still think that we should support reading out specific files
 using a filename as a key. I think a common use-case for ArchiveReader
 is going to be web developers wanting to download a set of resources
 from their own website and wanting to use a .zip file as a way to get
 compression and packaging. In that case they can easily either ensure
 to stick with ASCII filenames, or encode the names in UTF8.


That's what this was for:

// For convenience, add getter File? (DOMString name) to FileList, to
find a file by name.  This is equivalent
// to iterating through files[] and comparing .name.  If no match is
found, return null.  This could be a function
// instead of a getter.
var example_file2 = zipFile.files[file.txt];
if(example_file2 == null) { console.error(file.txt not found in ZIP;
return; }

I suppose a named getter isn't a great idea--you might have a filename
length--so a zipFile.files.find('file.txt') function is probably better.

By allowing them to download a .zip file, they can also store that
 .zip in compressed form in IndexedDB or the FileSystem API in order to
 use less space on the user's device. (Additionally many times IO gets
 faster by using .zip files because the time saved in doing less IO is
 larger than the time spent decompressing. Obviously very dependent on
 what data is being stored).


There's also the question of when decompression happens--you don't want to
decompress the whole thing in advance if you can avoid it, since if the
user isn't doing random access you can stream the decompression--but that's
just QoI, of course.

One way we could support this would be to have a method which allows
 getting a list of meta-data about each entry. Probably together with
 the File object itself. So we could return an array of objects like:

 [ {
 rawName: UInt8Array,
 file: File object,
 crc32: UInt8Array
   },
   {
 rawName: UInt8Array,
 file: File object,
 crc32: UInt8Array
   },
   ...
 ]

 That way we can also leave out the crc from archive types that doesn't
 support it.


This means exposing two objects per file.  I'd prefer a single
File-subclass object per file, with any extra metadata put on the subclass.


 This is definitely an interesting idea. The current API is designed
 around doing the IO when each individual operation is done. You are
 proposing to do all IO up front which allows all operations to be
 synchronous.

 I suspect that doing the IO lazily can provide better performance
 for some types of operations, such as only wanting to extract a single
 resource from an archive. But maybe the difference wouldn't be that
 big in most cases.


I'd expect the I/O savings to be negligible, since ZIP has a central
directory at the end, allowing the whole thing to be read very quickly.

I hope creating an array of File objects (even thousands of them) isn't too
expensive.  Even if it is, though, this could be refactored to still give a
synchronous interface: store the file directory natively (in a non-File,
non-GC'd way), and allow looking up and iterating that list in a way that
only instantiates one File object at a time.  (This would lose the FileList
API compatibility with input type=file, though, which I think is a nice
plus.)

But I like this approach a lot of we can make it work. The main thing
 I'd be worried about, apart from the IO performance above, is if we
 can make it work for a larger set of archive formats. Like, can we
 make it work for .tar and .tar.gz? I think we couldn't but we would
 need to verify.


It wouldn't handle it very well, but the original API wouldn't, either.  In
both cases, the only way to find filenames in a TAR--whether it's to search
for one or to construct a list--is to scan through the whole file (and
decompress it all, for .tgz).  Simply retrieving a list of filenames from a
large .tgz would thrash the user's disk and chew CPU.

I don't think there's much use in supporting .tar, anyway.  Even if you
want true streaming (which would be a different API anyway, since we're
reading from a Blob here), ZIP can do that too, by using the local file
headers instead of the central directory.

-- 
Glenn Maynard

Re: [whatwg] Archive API - proposal

2012-08-14 Thread Jonas Sicking

On Tue, Jul 17, 2012 at 7:23 PM, Andrea Marchesini b...@mozilla.com wrote:
 Hi All,

 I would like to propose a new javascript/web API that provides the ability to 
 read the content of an archive file through DOMFile objects.
 I have started to work on this API because it has been requested during some 
 Mozilla Game Meeting by game developers who often use ZIP files as storage 
 system.

 What I'm describing is a read-only and asynchronous API built on top of 
 FileAPI ( http://dev.w3.org/2006/webapi/FileAPI/ ).

 Here a draft written in webIDL:

 interface ArchiveRequest : DOMRequest
 {
   // this is the ArchiveReader:
   readonly attribute nsIDOMArchiveReader reader;
 }

 [Constructor(Blob blob)]
 interface ArchiveReader
 {
   // any method is supposed to be asynchronous

   // The ArchiveRequest.result is an array of strings (the filenames)
   ArchiveRequest getFilenames();

   // The ArchiveRequest.result is a DOMFile 
 (http://dev.w3.org/2006/webapi/FileAPI/#dfn-file)
   ArchiveRequest getFile(DOMString filename);
 };

 Here an example about how to use it:

 function startRead() {
   // Starting from a input type=file id=file /:
   var file = document.getElementById('file').files[0];

   if (file.type != 'application/zip') {
 alert(This archive format is not supported);
 return;
   }

   // The ArchiveReader object works with Blob objects:
   var archiveReader = new ArchiveReader(file);

   // Any request is asynchronous:
   var handler = archiveReader.getFilenames();
   handler.onsuccess = getFilenamesSuccess;
   handler.onerror = errorHandler;

   // Multiple requests can run at the same time:
   var handler2 = archiveReader.getFile(levels/1.txt);
   handler2.onsuccess = getFileSuccess;
   handler2.onerror = errorHandler;
 }

 // The getFilenames handler receives a list of DOMString:
 function getFilenamesSuccess() {
   for (var i = 0; i  this.result.length; ++i) {
 /* this.reader is the ArchiveReader:
 var handle = this.reader.getFile(this.result[i]);
 handle.onsuccess = ...
 */
   }
 }

 // The GetFile handler receives a File/Blob object (and it can be used with 
 FileReader):
 function getFileSuccess() {
   var reader = FileReader();
   reader.readAsText(this.result);
   reader.onload = function(event) {
 // alert(event.target.result);
   }
 }

 function errorHandler() {
   // ...
 }

 I would like to receive feedback about this.. In particular:
 . Do you think it can be useful?
 . Do you see any limitation, any feature missing?

FWIW, this API is now available in Firefox nightly builds. It's
currently on track to ship in Firefox 17. Feedback would still be
greatly appreciated!

/ Jonas

Re: [whatwg] Archive API - proposal

2012-08-14 Thread Glenn Maynard

(I've reordered my responses to give a more logical progression.)

On Tue, Jul 17, 2012 at 9:23 PM, Andrea Marchesini b...@mozilla.com wrote:

 // The getFilenames handler receives a list of DOMString:
 var handle = this.reader.getFile(this.result[i]);


This interface is problematic.  Since ZIP files don't have a standard
encoding, filenames in ZIPs are often garbage.  This API requires that
filenames round-trip uniquely, or else files aren't accessible t all.  For
example, if you have two filenames in CP932, 日 and 本, but the encoding
isn't determined correctly, you may end up with two files both with a
filename of ??.  Either you can't open either file, or you can only open
one of them.  This isn't theoretical; I hit ZIP files like this in the wild
regularly.

Instead, I'd recommend that the primary API simply returns File objects
directly from the ZIP.  For example:

var reader = archive.getFiles();
reader.onsuccess = function(result) {
// result = [File, File, File, File...];

console.log(result[0].name);
// read the file
new FileReader(result[0]);
}

This allows opening files without any dependency on the filename.  Since
File objects are by design lightweight--no decompression should happen
until you actually read from the file--this isn't expensive and won't
perform any extra I/O.  All the information you need to expose a File
object is in the central directory (filename, mtime, decompressed size).

I would like to receive feedback about this.. In particular:
 . Do you think it can be useful?
 . Do you see any limitation, any feature missing?


It should be possible to get the CRC32 of files, which ZIP stores in the
central directory.  This both allows the user to perform checksum
verification himself if wanted, and all the other variously useful things
about being able to get a file's checksum without having to read the whole
file.

(I don't think CRC32 checks should be performed automatically, since it's
too hard for that to make sense when random access is involved.)

  // The ArchiveReader object works with Blob objects:
   var archiveReader = new ArchiveReader(file);

   // Any request is asynchronous:


The only operation that needs to be asynchronous is creating the
ArchiveReader itself.  It should parse the ZIP central record before before
returning a result.  Once you've done that you can do the rest
synchronously, because no further I/O is necessary until you actually read
data from a file.

This gives the following, simpler interface:

var opener = new ZipOpener(file);
opener.onerror = function() { console.error(Loading failed); }
opener.onsuccess = function(zipFile)
{
// .files is a FileList, representing each file in the archive.
if(zipFile.files.length == 0) { console.error(ZIP file is empty);
return; }

var example_file = zipFile.files[0];
console.log(The first filename is, example_file.name, with an
expected CRC of, example_file.expectedCRC);

// Read from the file:
var reader = new FileReader(example_file);

// For convenience, add getter File? (DOMString name) to FileList, to
find a file by name.  This is equivalent
// to iterating through files[] and comparing .name.  If no match is
found, return null.  This could be a function
// instead of a getter.
var example_file2 = zipFile.files[file.txt];
if(example_file2 == null) { console.error(file.txt not found in ZIP;
return; }
}

(To fit expectedCRC in there, it would actually need to use a subclass of
File, not File itself.)

This also eliminates an error condition (no getFile error callback), and
since .files looks just like HTMLInputElement.files, it can be used
directly with code written for it.  For example, if you have a function
uploadAllFiles(files), you can pass in both an input type=file
multiple's .input or a zipFile.files, and they'll both work.

-- 
Glenn Maynard

Re: [whatwg] Archive API - proposal

2012-08-14 Thread Tobie Langel

On Aug 14, 2012, at 21:21, Glenn Maynard gl...@zewt.org wrote:

 (I've reordered my responses to give a more logical progression.)

 On Tue, Jul 17, 2012 at 9:23 PM, Andrea Marchesini b...@mozilla.com wrote:

 // The getFilenames handler receives a list of DOMString:
 var handle = this.reader.getFile(this.result[i]);


 This interface is problematic.  Since ZIP files don't have a standard
 encoding, filenames in ZIPs are often garbage.  This API requires that
 filenames round-trip uniquely, or else files aren't accessible t all.  For
 example, if you have two filenames in CP932, 日 and 本, but the encoding
 isn't determined correctly, you may end up with two files both with a
 filename of ??.  Either you can't open either file, or you can only open
 one of them.  This isn't theoretical; I hit ZIP files like this in the wild
 regularly.

 Instead, I'd recommend that the primary API simply returns File objects
 directly from the ZIP.  For example:

 var reader = archive.getFiles();
 reader.onsuccess = function(result) {
// result = [File, File, File, File...];

console.log(result[0].name);
// read the file
new FileReader(result[0]);
 }

 This allows opening files without any dependency on the filename.  Since
 File objects are by design lightweight--no decompression should happen
 until you actually read from the file--this isn't expensive and won't
 perform any extra I/O.  All the information you need to expose a File
 object is in the central directory (filename, mtime, decompressed size).

 I would like to receive feedback about this.. In particular:
 . Do you think it can be useful?
 . Do you see any limitation, any feature missing?


 It should be possible to get the CRC32 of files, which ZIP stores in the
 central directory.  This both allows the user to perform checksum
 verification himself if wanted, and all the other variously useful things
 about being able to get a file's checksum without having to read the whole
 file.

 (I don't think CRC32 checks should be performed automatically, since it's
 too hard for that to make sense when random access is involved.)

  // The ArchiveReader object works with Blob objects:
  var archiveReader = new ArchiveReader(file);

  // Any request is asynchronous:


 The only operation that needs to be asynchronous is creating the
 ArchiveReader itself.  It should parse the ZIP central record before before
 returning a result.  Once you've done that you can do the rest
 synchronously, because no further I/O is necessary until you actually read
 data from a file.

 This gives the following, simpler interface:

 var opener = new ZipOpener(file);
 opener.onerror = function() { console.error(Loading failed); }
 opener.onsuccess = function(zipFile)
 {
// .files is a FileList, representing each file in the archive.
if(zipFile.files.length == 0) { console.error(ZIP file is empty);
 return; }

var example_file = zipFile.files[0];
console.log(The first filename is, example_file.name, with an
 expected CRC of, example_file.expectedCRC);

// Read from the file:
var reader = new FileReader(example_file);

// For convenience, add getter File? (DOMString name) to FileList, to
 find a file by name.  This is equivalent
// to iterating through files[] and comparing .name.  If no match is
 found, return null.  This could be a function
// instead of a getter.
var example_file2 = zipFile.files[file.txt];
if(example_file2 == null) { console.error(file.txt not found in ZIP;
 return; }
 }

 (To fit expectedCRC in there, it would actually need to use a subclass of
 File, not File itself.)

 This also eliminates an error condition (no getFile error callback), and
 since .files looks just like HTMLInputElement.files, it can be used
 directly with code written for it.  For example, if you have a function
 uploadAllFiles(files), you can pass in both an input type=file
 multiple's .input or a zipFile.files, and they'll both work.

How are nested directories handled in your counter proposal?

--tobie

[whatwg] Archive API - proposal

2012-07-17 Thread Andrea Marchesini

Hi All,

I would like to propose a new javascript/web API that provides the ability to 
read the content of an archive file through DOMFile objects.
I have started to work on this API because it has been requested during some 
Mozilla Game Meeting by game developers who often use ZIP files as storage 
system.

What I'm describing is a read-only and asynchronous API built on top of FileAPI 
( http://dev.w3.org/2006/webapi/FileAPI/ ).

Here a draft written in webIDL:

interface ArchiveRequest : DOMRequest
{
  // this is the ArchiveReader:
  readonly attribute nsIDOMArchiveReader reader;
}

[Constructor(Blob blob)]
interface ArchiveReader
{
  // any method is supposed to be asynchronous

  // The ArchiveRequest.result is an array of strings (the filenames)
  ArchiveRequest getFilenames();

  // The ArchiveRequest.result is a DOMFile 
(http://dev.w3.org/2006/webapi/FileAPI/#dfn-file)
  ArchiveRequest getFile(DOMString filename);
};

Here an example about how to use it:

function startRead() {
  // Starting from a input type=file id=file /:
  var file = document.getElementById('file').files[0];

  if (file.type != 'application/zip') {
alert(This archive format is not supported);
return;
  }

  // The ArchiveReader object works with Blob objects:
  var archiveReader = new ArchiveReader(file);

  // Any request is asynchronous:
  var handler = archiveReader.getFilenames();
  handler.onsuccess = getFilenamesSuccess;
  handler.onerror = errorHandler;

  // Multiple requests can run at the same time:
  var handler2 = archiveReader.getFile(levels/1.txt);
  handler2.onsuccess = getFileSuccess;
  handler2.onerror = errorHandler;
}

// The getFilenames handler receives a list of DOMString:
function getFilenamesSuccess() {
  for (var i = 0; i  this.result.length; ++i) {
/* this.reader is the ArchiveReader:
var handle = this.reader.getFile(this.result[i]);
handle.onsuccess = ...
*/
  }
}

// The GetFile handler receives a File/Blob object (and it can be used with 
FileReader):
function getFileSuccess() {
  var reader = FileReader();
  reader.readAsText(this.result);
  reader.onload = function(event) {
// alert(event.target.result);
  }
}

function errorHandler() {
  // ...
}

I would like to receive feedback about this.. In particular:
. Do you think it can be useful?
. Do you see any limitation, any feature missing?

Regards,
AM

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

Re: [whatwg] Archive API - proposal

[whatwg] Archive API - proposal

11 matches

Site Navigation

Mail list logo

Footer information