Re: Compilation times and idiomatic D code

Enamex via Digitalmars-d Sat, 15 Jul 2017 04:16:19 -0700

On Friday, 14 July 2017 at 22:45:44 UTC, H. S. Teoh wrote:

Here's a further update to the saga of combating ridiculouslylarge symbol sizes.
So yesterday I wrote a new module that also heavily uses UFCSchains. My initial draft of the module, once I linked it withthe main program, particularly with a UFCS chain that has ledto the 600MB executable sizes seen above, caused anotherexplosion in symbol size that actually managed to reach 100MBin *one* symbol, triggering a DMD termination complaining aboutpossible infinite template recursion. :-D Funnier still,temporarily simplifying part of the chain to bring symbol sizesdown, I eventually got it below 100MB but ended up with linkersegfaults and ELF errors because the huge symbol was tooridiculously huge.
Eventually, it drove me to refactor two Phobos functions thatare used heavily in my code: std.range.chunks andstd.algorithm.joiner, using the same "horcrux" technique (seePhobos PRs #5610 and #5611). This, together with some furtherrefactoring in my own code, eventually brought things down tothe 20MB range of executable sizes.
Then an idea occurred to me: the reason these symbol sizes gotso large, was because the UFCS chain preserves *all* typeinformation about every component type used to build the finaltype. So, by necessity, the type name has to somehow encode allof this information in an unambiguous way. Now, arguably,DMD's current mangling scheme is at fault because it containstoo many repeating components, but even if you disregard that,the fact remains that if you have 50+ components in youroverall UFCS chain, the symbol length cannot be less than 50*nwhere n is the average length of a single component's typename, plus some necessary overhead to account for the manglingscheme syntax. Let's say n is on average 20-25 characters, sayround it up to 35 for mangling syntax, so you're still lookingat upwards of 1700+ characters *minimum*. That, coupled withthe current O(n^2) / O(n^3) mangling scheme, you easily reachmegabytes of symbol length. We can compress the symbols all wewant, but there's a limit as to how much compression will help.At the end of the day, you still have to represent those 50+components *somehow*.
But what if most of this type information is actually*unnecessary*? To use a range example, if all you care aboutat the end of the chain is that it's a forward range of ubyte,then why even bother with multi-MB symbols encoding typeinformation that's never actually used? Maybe a littletype-erasure would help, via hiding those 50+ component typesbehind an opaque runtime OO polymorphic interface. Phobos doeshave the facilities of this, in the form of the InputRange,ForwardRange, etc., interfaces in std.range.interfaces. In mycase, however, part of the chain uses another generic type (akind of generalization of 2D arrays). But either way, the ideais simple: at the end of the UFCS chain, wrap it in a classobject that inherits from a generic interface that encodes onlythe primitives of the 2D array concept, and the element type.The class object itself, of course, still must retain knowledgeof the full-fledged type, but the trick is that if we insertthis type erasure in the middle of the chain, then latercomponents don't have to encode the type names of earliercomponents anymore.
T


I have some stupid questions:

- What does everyone mean when they say 'symbol' here? I'mprobably misunderstanding symbols gravely or it's something thatDMD in particular handles in a strange way.

- What type information are being kept because of UFCS chains?Doesn't that mechanism simply apply overload resolution thenchoose between the prefix and .method forms as appropriate,rewriting the terms?Then it's a problem of function invocation. I don't getwhat's happening here still. Does this tie to the Voldemort typesproblem? (=> are many of the functions in the chain returningcustom types?) It doesn't make sense, especially if, from yourindirection workaround, it looks like it would work around thesame but without the bloat problem if we unlinked the chain intomany intermediate temporary parts. So how is this a typeinformation issue?


Thanks!

Re: Compilation times and idiomatic D code

Reply via email to