Re: Compilation strategy

foobar Tue, 18 Dec 2012 03:05:40 -0800

On Tuesday, 18 December 2012 at 00:15:04 UTC, H. S. Teoh wrote:

On Tue, Dec 18, 2012 at 02:08:55AM +0400, Dmitry Olshanskywrote:
[...]
I suspect it's one of prime examples where UNIX philosophy of
combining a bunch of simple (~ dumb) programs together inplace ofone more complex program was taken *far* beyond reasonablelengths.
Having a pipe-line:
preprocessor -> compiler -> (still?) assembler -> linker
where every program tries hard to know nothing about theprevious
ones (and be as simple as possibly can be) is bound to get
inadequate results on many fronts:
- efficiency & scalability
- cross-border error reporting and detection (linker errors?errors
for expanded macro magic?)
- cross-file manipulations (e.g. optimization, see _how_ LTOis done in GCC)
- multiple problems from a loss of information across pipeline*
The problem is not so much the structure preprocessor ->compiler ->assembler -> linker; the problem is that these logical stageshave beenarbitrarily assigned to individual processes residing in theirownaddress space, communicating via files (or pipes, whatever itmay be).
The fact that they are separate processes is in itself not thatbig of aproblem, but the fact that they reside in their own addressspace is abig problem, because you cannot pass any information down thechainexcept through rudimentary OS interfaces like files and pipes.Even that
wouldn't have been so bad, if it weren't for the fact that user
interface (in the form of text input / object file format) hasalso beenconflated with program interface (the compiler has to producethe inputto the assembler, in *text*, and the assembler has to produceobjectfiles that do not encode any direct dependency informationbecause
that's the standard file format the linker expects).

Now consider if we keep the same stages, but each stage is not a
separate program but a *library*. The code then might look, ingreatly
simplified form, something like this:

        import libdmd.compiler;
        import libdmd.assembler;
        import libdmd.linker;

        void main(string[] args) {
                // typeof(asmCode) is some arbitrarily complex data
                // structure encoding assembly code, inter-module
                // dependencies, etc.
                auto asmCode = compiler.lex(args)
                        .parse()
                        .optimize()
                        .codegen();

                // Note: no stupid redundant convert to string, parse,
                // convert back to internal representation.
                auto objectCode = assembler.assemble(asmCode);

                // Note: linker has direct access to dependency info,
                // etc., carried over from asmCode -> objectCode.
                auto executable = linker.link(objectCode);
                File output(outfile, "w");
                executable.generate(output);
        }
Note that the types asmCode, objectCode, executable, arearbitrarilycomplex, and may contain lazy-evaluated data structure,references toon-disk temporary storage (for large projects you can't holdeverythingin RAM), etc.. Dependency information in asmCode is propagatedtoobjectCode, as necessary. The linker has full access to allinfo thecompiler has access to, and can perform inter-moduleoptimization, etc.,by accessing information available to the *compiler* front-end,not just
some crippled object file format.
The root of the current nonsense is that perfectly-fine datastructuresare arbitrarily required to be flattened into some kind ofintermediateform, written to some file (or sent down some pipe), often withloss of
information, then read from the other end, interpreted, and
reconstituted into other data structures (with incompleteinfo), thenprocessed. In many cases, information that didn't make itthrough thechannel has to be reconstructed (often imperfectly), and thenused. Mostof these steps are redundant. If the compiler data structureswerealready directly available in the first place, none of thisbaroque
dance is necessary.
*Semantic info on interdependency of symbols in a source fileis
destroyed right before the linker and thus each .obj file is
included as a whole or not at all. Thus all C run-times I'veseen_sidestep_ this by writing each function in its own file(!).Even
this alone should have been a clear indication.
While simplicity (and correspondingly size in memory) ofprogramswas the king in 70's it's well past due. Nowadays I think isall
about getting highest throughput and more powerful features.
[...]
Simplicity is good. Simplicity lets you modularize a verycomplex pieceof software (a compiler that converts D source code intoexecutables)into manageable chunks. Simplicity does not requireshoe-horning modulesinto separate programs with separate address spaces withseparate (and
deficient) input/output formats.
The problem isn't with simplicity, the problem is with carryingover thearchaic mapping of compilation stage -> separate program. Imean,imagine if std.regex was written so that regex compilation runsin aseparate program with a separate address space, and the regexmatcherthat executes the match runs in another separate program with aseparateaddress space, and the two talk to each other via pipes, orworse,
intermediate files.
I've mentioned a few times before a horrendous C++ project thatI had towork with once, where to make a single function call to aparticularsubsystem, it had to go through 6 layers of abstraction, one ofwhichwas IPC through a local UNIX socket, *and* another of whichinvolved
fwrite()ing function parameters into a file and fread()ing said
parameters from the file in another process, with the 6 layersrepeatingin reverse to propagate the return value of the function backto the
caller.
In the new version of said project, that subsystem exposes alibrary APIwhere to make a function call, you, um, just call the function(gee,what a concept). Needless to say, it didn't take a lot ofeffort toconvince customers to upgrade, upon which we proceeded withgreat relishto delete every single source file having to do with that6-layered
monstrosity, and had a celebration afterwards.
From the design POV, though, the layout of the old version ofthe
project utterly made sense. It was superbly (over)engineered,and if youmade UML diagrams of it, they would be works of art fit for theBritishMuseum. The implementation, however, was "somewhat"disappointing.
T

IMO, it's not even an issue of the separate address spaces. Thecore problem is the direct result of relying on *archaic fileformats*.Simply using serialization of the intermediate data structurealready solves the data loss problems and all that remains areaspects of efficiency which are much less important given currentcompilation speeds. Separate address spaces can be useful if weadd distributed and concurrent aspects into the mix.

Re: Compilation strategy

Reply via email to