Re: [darcs-users] RFC: Third-party applications directory

Stephen J. Turnbull Tue, 03 Jan 2006 19:16:43 -0800

>>>>> "Olivier" == Olivier Thauvin <[EMAIL PROTECTED]> writes:

    Olivier> Le Mardi 03 Janvier 2006 19:22, Alberto Bertogli a
    Olivier> écrit :
    >> On Tue, Jan 03, 2006 at 11:32:01AM +0000, Pedro Melo wrote:

> > >* encoding
> > >   The encoding used for repository data and metadata (ie. covers
> > >   files AND patch descriptions).

Please rethink this.  All files should be binary data from the point
of view of text encoding.  Consider what happens when you have a
Makefile with file names in ISO Cyrillic, and somebody wants to put
your code on a filesystem where file names are encoded in UTF-8.
Dealing with this *must* be a user-level operation.  (I'm
authoritative<wink> on that; I have ten years experience dealing with
the fallout from the most comprehensive attempt to deal with it
automatically---Emacs/MULE.)  Core darcs should preserve file data at
all costs---this allows users to recover from mistakes in transcoding.
It's reasonable to have an advisory property to help downstream users
of the data cope with those issues, but Darcs itself should strictly
ignore this flag.

For metadata, there should be only one encoding used: Unicode.  Five
UTFs would be allowed: 7-bit unibyte aka US-ASCII and ISO-646-REF,
8-bit unibyte aka ISO-8859-1, UTF-8 with signature (BOM), and two
UTF-16s with signature (BOM).  The first two are small subsets of
Unicode, of course, but quite useful for many users.  They _are_
Unicode; Unicode does not prohibit subsetting, it only requires that
the subset be given the semantics provided in the standard.  So any
program that can deal with Unicode can deal with all of these UTFs.

If there's no signature you can use Latin-1 if you like, or you can
use the standard pseudo-ligatures ("ae", "ss", etc) for compatibility
with deficient USian systems.  That should cover almost all the users
who are worried about this.  There is no ambiguity, and by providing a
trivial BOM detector in darcs you can at least have a standard warning
"encoded in UTF-16-BE, which your system cannot display".

ISTR a Russian or other Cyrillic user in the group talking about this
stuff, I'm very sorry but this really is necessary.  People who want
to talk about money in their logs can use the standard abbreviation
EUR.  Everybody else is covered.

Why is this necessary?  Because people _will_ make different
implementations.  I assure you that if Unicode is not required there
_will_ be a "Japanese patch" that only handles non-Unicode Japanese,
and doesn't grok Windows 12xx, while there will probably be Windows
and KOI8 patches that don't understand EUC-JP or Big5.  These
implementations will be mutually incompatible.

This strategy is fairly inexpensive to implement, and does not prevent
use of national encodings in editors and the like.  There are two free
comprehensive implementations of coding translation, GNU libc's gconv
(the guts of libc's iconv implementation) and GNU recode (more
portable and standalone).  I don't know how good the coverage of *BSD
iconv is, but I suppose by now it's excellent, and at least NetBSD
pkgsrc and DarwinPorts provide recode ports.  The main proprietary
personal OSes (Mac OS X and Windows) handle Unicode fine.  Add a hook
to call an appropriate coding program and a bit of defensive
sanity-checking internally, and you're set to go.

Backward compatibility with legacy encoded repos?  Too bad.  You've
already screwed the rest of the world, convert and get busy making
your next version so good that nobody will ever look at the logs of
the previous one.  :-)  For backward compatibility, patches would have
to get a darcs-version property (I think they actually already have
that? or is it just repos?).  Patches without a version would have
their logs displayed as ASCII if 7-bit, and as ISO-8859-1 + warning to
stderr about possible data loss if 8-bit.

There is nothing here that says you can't make wrappers for backward
compatibility with your own stuff, and distribute them to downstream
users of your code.  The point is not to make a standard that requires
other shops and other implementations to read your legacy encodings.
For internal use in Darcs metadata, Darcs should standardize on
Unicode, and it should do so now.

    Olivier> Properties having name begining by 'darcs:' have special
    Olivier> meaning for darcs (imagine darcs:eol, darcs:encoding,
    Olivier> darcs:executable).

Why not just use XML namespaces and be done with it?  expat and
libxml2 (among others) handle namespaces fine.  Most programming
languages have bindings for at least one of them.  Then you can use
the canonical form {http://darcs.org/repo-property-names}eol, and use
the standard xmlns attribute to abbreviate:

HTTP/1.1 PROPFIND your-repo

<?xml version="1.0" encoding="utf-8"?>
<DAV:propfind xmlns:DAV="DAV:">        <!-- may be unnecessary -->
  <DAV:prop xmlns:darcs="http://darcs.org/repo-property-names";>
    <darcs:eol />
    <darcs:executable />
  </DAV:prop>
</DAV:propfind>

As usual with XML, it's hideously verbose, but more or less readable.

N.B.  The URI http://darcs.org/repo-property-names is an URN.  You
could also put the standard there, but that's not necessary.  The
point is that since Darcs owns darcs.org, you can give it a
descriptive name that is guaranteed not to conflict with (eg)
http://subversion.org/repo-property-names.

    Olivier> darcs tools will then have new command: propset, propdel
    Olivier> and propget.

All of this can be done with WebDAV.  Please don't reinvent these
wheels.

I'm mindful of Julius's rant against DeltaV (the versioning parts of
WevDAV).  But this proposal doesn't need DeltaV.  Definitely the
property management portion is usable.

There may be some useful information, specifically URLs for RFCS, at

    http://turnbull.sk.tsukuba.ac.jp/Blogs/Software/WebDAVSupport

There's also a fair amount of ranting about the state of the art in
support libraries, and some discussion of implementation-in-progress
of editor (ie, client-side) support for WebDAV.  You probably want to
take those with a grain of salt.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

_______________________________________________
darcs-users mailing list
[email protected]
http://www.abridgegame.org/mailman/listinfo/darcs-users

Re: [darcs-users] RFC: Third-party applications directory

Reply via email to