[Gnu-arch-users] programming in the large (Re: On configs and huge source trees)

Thomas Lord Tue, 18 Oct 2005 12:09:38 -0700

Arch is designed for "programming in the large" -- dealing
with very, very large collections of source.  Ludovic's
comments are an occasion to refresh people's memory about
that.


Ludovic argues that very large trees are questionable practice
and that, although it might not be the revision control system's
job to impose practice, nevertheless some large-tree projects
could be improved by using configs.  Ok, I want to refine those
arguments but first some context:

The ultimate "huge source tree", and the one that was one of the
main inspirations for configs, is a tree of complete system 
source including kernel, system configuration files, all libraries,
and all installed programs.

In such a tree there are lots of components which are separately
maintained and many which are used by more than one tool.  In
some cases, we have to expect a single component to occur in more
than one place in the huge tree.  Realistically, we even have to
expect the huge tree to contain multiple distinct versions of 
some components.

Developers have needs to create multiple instances of such huge
trees in their workspaces and to have some of these multiple instances
be coherent subsets.   For example: "give me everything other than
the X11 libraries and libc that I need to build Evolution".

Administrators, distribution maintainers, and users have a need
to be able to audit these huge trees -- to accurately summarize the
layout and list of included components in terms of global names of
the particular version of each component.  Such a list of components
is a good *definition* for a named release of a (source-based)
distribution.

Configs are largely for those kinds of developer, admin, distribution
maintainer, and user needs.  They facilitate the separate development
of logically separate components and they give us a global name-space
for specific constellations of components.

Making sure that the config system works on top of a revision control
system which is not only distributed but features peer-to-peer 
replication, dumb-fs servers, and good cryptographic-based integrity
checking and authentication helps too.  For example, if we had a
network of dumb boxes being constantly, incrementally flood-filled
with updates to components, the distribution business would be improved.
A distribution publisher could simply sign a config saying "This has 
passed our testing and is dubbed distribution release 10.0."  No
need for a central, closely held network of "update servers" (though
they would still have a limited utility for some customers).  Instead,
most systems could update from the closest public mirror.   For an 
extra layer of paranoia reduction, public crawlers could be built which
compare these mirrors to one another, etc.  Security would be increased.
Emergency distribution updates would be less vulnerable to denial of
service attacks.

Of course, good revision control is only half of the equation.  We 
would also need well designed conventions for configure/build/install
infrastructure.   A huge tree of all system sources needs, for example,
something like "make world" which, at the root, configures, builds, 
and installs all subtrees.   Subtrees have to be independently
constructable.  Installation conventions have to be flexible enough
so that there is a universal mechanism for installing and running
test versions without interfering with or accidentally using the 
system install.   Of course nearly every package has something kinda-
sorta like this but there is too little consistency among packages and
so to get from upstream of individual components to a source-based 
distribution requires too much work.   Autoconf was never designed with
programming in the large in mind -- it takes the view that people work
with one package at a time.  The package-framework component that tla
uses was never intended to me more than a sketch of what is actually
needed.

It's politically tricky (not impossible, I think) to get upstreams to
adopt coherent distributed revision control practices and improved
configure/build/install conventions.   Tools designed with those needs
in mind are part of the solution.  Obtaining or simulating a critical
mass of upstream projects that use such tools might do the trick.

Part of the political problem is that the FOSS community has lost
(mistaken for solved) the problem of constructing a "complete GNU
system" (or something morally comparable).  The vendors and Debian
have taught people to think of each upstream project as separate --
to think of the harmonizing of components into a complete system
as "somebody else's problem".  In fact, it would take only minor
shifts in tools and conventional practices to eliminate most of 
the expensive drudge-work of distribution assembly, freeing up those
resources for more appropriate tasks like component testing, review,
and other forms of vetting, not to mention for more aggressive 
programs of forward-thinking R&D.

So, Ludovic:

Yes, factoring into configs or something very much like them is not
only best practice, it's just about the only practice that makes
good sense.

Revision control's proper role there is to provide the config-like
thing.

Performance limitations of revision control are clearly not a happy
excuse for using configs.  Reports about achievable tla performance
on things like gcc and the kernel are mixed: my understanding is that
some people have obtained performance far better than that recently 
reported here.  People are complaining about `baz status' but my
understanding is that that command has long-known, unfixed
implementation bugs which give rise to that bad performance -- it seems
odd to tar tla with that.

In any event there's the Arch 2.0 direction, gathering dust on a shelf.
No matter how many times Matthieu calls it a "complete rewrite" that
doesn't make it true.  *`revc'* is, indeed, a completely newly coded
storage manager.   It does replace one part of Arch but not the rest.
It can do things like give git-like speed for commits and filename-based
tree comparisons.  It rests for want of resources to port inventory and
merging features from tla.

-t




_______________________________________________
Gnu-arch-users mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/gnu-arch-users

GNU arch home page:
http://savannah.gnu.org/projects/gnu-arch/

[Gnu-arch-users] programming in the large (Re: On configs and huge source trees)

Reply via email to