Hello,

I appologize if this topic comes up often, but the current patch
format is very simple and I think we could do better.  (I've been
bitten by the efficiency of the current way many times.)

My propsal is to use a format similar to ar (or we could maybe use ar
as is).

Currently, binary files are stored ASCII-enarmored inside the patch
file which is then gzip'd.  This can be extremly slow to process if
the binary file is large.  And it's not only a problem when the file
needs to be extracted from the patch, but when darcs needs information
about entries in the patch file that come after the large entries.  My
proposed way of dealing with this is to have some sort of length
encoding so that file system calls such as lseek can be used to jump
to the next file in what amounts to essentially constant time (all the
OS needs to do is change an offset associated with the file descriptor
and then issue a read).

Another feature I hear people asking about from time to time is
storing of permission information (or more generally meta-data).  This
could also be easy to do using the ar format which has fields for
storing a small bit of meta-data.

The easiest way of taking advantage of this new format is as follows:

1) The currenty patch format is kept the way it is for textual
   patches.  Then this gzip'd file is put in the archive, with the
   appropriate header.  This creates a small amount of overhead
   (should be very small).

2) Binary files can be stored as-is or compressed (usually compressing
   binary files doesn't help much, so why bother?) inside the ar file
   taking full advantage of the ar format.  Allowing easy copying to
   disk or seeking past the contents to look at other data in the
   archive.

And probably the better way to use ar:

1) Patches that are related to a given file are gzip'd and stored
   together as an entry in the ar file along with permission data.

2) Binary files are stored as above.

There are other variations, such as storing each hunk as a different
entry in the patch file, or using a hybrid approach, where there is
one entry in the ar file that serves as a table of contents and then
use something like the first approach.

And I'd hate to hear this idea get shot down on the basis that ar is
not flexible enough or because it is tied to unix, so just let me say
this.  We don't have to exactly implement ar, but I think it provides
a good definition to start from.

As a quick reference I found this website:
http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/files/aixfiles/ar_IA64.htm

Or, if you prefer tiny-urls:
http://tinyurl.com/73wl9

Thanks,
Jason

_______________________________________________
darcs-devel mailing list
[email protected]
http://www.abridgegame.org/cgi-bin/mailman/listinfo/darcs-devel

Reply via email to