On Tuesday, 18 December 2012 at 00:15:04 UTC, H. S. Teoh wrote:
On Tue, Dec 18, 2012 at 02:08:55AM +0400, Dmitry Olshansky
wrote:
[...]
I suspect it's one of prime examples where UNIX philosophy of
combining a bunch of simple (~ dumb) programs together in
place of
one more complex program was taken *far* beyond reasonable
lengths.
Having a pipe-line:
preprocessor -> compiler -> (still?) assembler -> linker
where every program tries hard to know nothing about the
previous
ones (and be as simple as possibly can be) is bound to get
inadequate results on many fronts:
- efficiency & scalability
- cross-border error reporting and detection (linker errors?
errors
for expanded macro magic?)
- cross-file manipulations (e.g. optimization, see _how_ LTO
is done in GCC)
- multiple problems from a loss of information across pipeline*
The problem is not so much the structure preprocessor ->
compiler ->
assembler -> linker; the problem is that these logical stages
have been
arbitrarily assigned to individual processes residing in their
own
address space, communicating via files (or pipes, whatever it
may be).
The fact that they are separate processes is in itself not that
big of a
problem, but the fact that they reside in their own address
space is a
big problem, because you cannot pass any information down the
chain
except through rudimentary OS interfaces like files and pipes.
Even that
wouldn't have been so bad, if it weren't for the fact that user
interface (in the form of text input / object file format) has
also been
conflated with program interface (the compiler has to produce
the input
to the assembler, in *text*, and the assembler has to produce
object
files that do not encode any direct dependency information
because
that's the standard file format the linker expects).
Now consider if we keep the same stages, but each stage is not a
separate program but a *library*. The code then might look, in
greatly
simplified form, something like this:
import libdmd.compiler;
import libdmd.assembler;
import libdmd.linker;
void main(string[] args) {
// typeof(asmCode) is some arbitrarily complex data
// structure encoding assembly code, inter-module
// dependencies, etc.
auto asmCode = compiler.lex(args)
.parse()
.optimize()
.codegen();
// Note: no stupid redundant convert to string, parse,
// convert back to internal representation.
auto objectCode = assembler.assemble(asmCode);
// Note: linker has direct access to dependency info,
// etc., carried over from asmCode -> objectCode.
auto executable = linker.link(objectCode);
File output(outfile, "w");
executable.generate(output);
}
Note that the types asmCode, objectCode, executable, are
arbitrarily
complex, and may contain lazy-evaluated data structure,
references to
on-disk temporary storage (for large projects you can't hold
everything
in RAM), etc.. Dependency information in asmCode is propagated
to
objectCode, as necessary. The linker has full access to all
info the
compiler has access to, and can perform inter-module
optimization, etc.,
by accessing information available to the *compiler* front-end,
not just
some crippled object file format.
The root of the current nonsense is that perfectly-fine data
structures
are arbitrarily required to be flattened into some kind of
intermediate
form, written to some file (or sent down some pipe), often with
loss of
information, then read from the other end, interpreted, and
reconstituted into other data structures (with incomplete
info), then
processed. In many cases, information that didn't make it
through the
channel has to be reconstructed (often imperfectly), and then
used. Most
of these steps are redundant. If the compiler data structures
were
already directly available in the first place, none of this
baroque
dance is necessary.
*Semantic info on interdependency of symbols in a source file
is
destroyed right before the linker and thus each .obj file is
included as a whole or not at all. Thus all C run-times I've
seen
_sidestep_ this by writing each function in its own file(!).
Even
this alone should have been a clear indication.
While simplicity (and correspondingly size in memory) of
programs
was the king in 70's it's well past due. Nowadays I think is
all
about getting highest throughput and more powerful features.
[...]
Simplicity is good. Simplicity lets you modularize a very
complex piece
of software (a compiler that converts D source code into
executables)
into manageable chunks. Simplicity does not require
shoe-horning modules
into separate programs with separate address spaces with
separate (and
deficient) input/output formats.
The problem isn't with simplicity, the problem is with carrying
over the
archaic mapping of compilation stage -> separate program. I
mean,
imagine if std.regex was written so that regex compilation runs
in a
separate program with a separate address space, and the regex
matcher
that executes the match runs in another separate program with a
separate
address space, and the two talk to each other via pipes, or
worse,
intermediate files.
I've mentioned a few times before a horrendous C++ project that
I had to
work with once, where to make a single function call to a
particular
subsystem, it had to go through 6 layers of abstraction, one of
which
was IPC through a local UNIX socket, *and* another of which
involved
fwrite()ing function parameters into a file and fread()ing said
parameters from the file in another process, with the 6 layers
repeating
in reverse to propagate the return value of the function back
to the
caller.
In the new version of said project, that subsystem exposes a
library API
where to make a function call, you, um, just call the function
(gee,
what a concept). Needless to say, it didn't take a lot of
effort to
convince customers to upgrade, upon which we proceeded with
great relish
to delete every single source file having to do with that
6-layered
monstrosity, and had a celebration afterwards.
From the design POV, though, the layout of the old version of
the
project utterly made sense. It was superbly (over)engineered,
and if you
made UML diagrams of it, they would be works of art fit for the
British
Museum. The implementation, however, was "somewhat"
disappointing.
T
IMO, it's not even an issue of the separate address spaces. The
core problem is the direct result of relying on *archaic file
formats*.
Simply using serialization of the intermediate data structure
already solves the data loss problems and all that remains are
aspects of efficiency which are much less important given current
compilation speeds. Separate address spaces can be useful if we
add distributed and concurrent aspects into the mix.