Re: Compilation strategy
12/18/2012 9:15 PM, Walter Bright пишет: On 12/18/2012 8:48 AM, Dmitry Olshansky wrote: After dropping debug info I can't yet make heads or tails of what's in the exe yet but it _seems_ to not include all of the unused code. Gotta investigate on a smaller sample. Generate a linker .map file (-map to dmd). That'll tell you what's in it. It's rather enlightening especially after running ddemangle over it. Still it tells only half the story - what symbols are there (and a lot of them shouldn't have been) - now the most important part to figure out is _why_. Given that almost everything is templates and not instantiated (thus thank god is not present). Still both quite some templates and certain normal functions made it in without ever being called. I'm sure they are not called because I just imported the module. Adding trace prints to the functions in question shows nothing on screen. I tried running linker with -xref and I see that the stuff I don't expect to land in .exe looks either like this: Symbol Defined Referenced immutable(unicode_tables.SetEntry!(ushort).SetEntry) unicode_tables.unicodeCased unicode_tables (Meaning that it's not referenced anywhere yet present but I guess unreferenced global data is not stripped away) Or (for functions): dchar uni.toUpper(dchar) uni uni const(@trusted dchar function(uint)) uni.Grapheme.opIndex uni uni ... Meaning that it's defined & referenced in the same module only (not the one with empty main). Yet it's getting pulled in... I'm certain that at least toUpper is not called anywhere in the empty module (nor module ctors, as I have none). Can you recommend any steps to see the web of symbols that eventually pulls them in? Peeking at dependency chain (not file-grained but symbol grained) in any form would have been awesome. -- Dmitry Olshansky
Re: Compilation strategy
On Wednesday, 19 December 2012 at 17:17:34 UTC, Dmitry Olshansky wrote: 12/19/2012 1:33 AM, Walter Bright пишет: On 12/18/2012 11:58 AM, Dmitry Olshansky wrote: The same bytecode then could be be used for external representation. Sigh, there is (again) no point to an external bytecode. BTW In the end I think I was convinced that bytecode won't buy D much. Esp considering the cost of maintaining a separate spec for it and making sure both are in sync. This argument is bogus. One of the goals of bytecode formats is to provide a common representation for many languages. That is one of the state goals for MS' CIL and LLVM. So while it's definitely true that maintaining and *additional* format and spec adds considerable costs, but more importantly this is incorrect when *reusing already existing* such formats, not to mention the benefits of interoperability with other supported languages and platforms. Consider, calling Java libraries from JRuby, using C# code in F# projects, etc. Say I want to use both Haskel and D in the same project, How would I do it? Using LLVM I should be able to - both GHC and LDC are based on LLVM.
Re: Compilation strategy
12/19/2012 12:15 AM, Paulo Pinto пишет: Am 18.12.2012 21:09, schrieb Dmitry Olshansky: 12/19/2012 12:01 AM, Jacob Carlborg пишет: On 2012-12-18 17:48, Dmitry Olshansky wrote: I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy it if need be). dumpobj is included in the DMD release, at least on Mac OS X. And linux has it. Guess Windows sucks ... dumpbin Only COFF I guess ;) -- Dmitry Olshansky
Re: Compilation strategy
12/19/2012 1:33 AM, Walter Bright пишет: On 12/18/2012 11:58 AM, Dmitry Olshansky wrote: The same bytecode then could be be used for external representation. Sigh, there is (again) no point to an external bytecode. BTW In the end I think I was convinced that bytecode won't buy D much. Esp considering the cost of maintaining a separate spec for it and making sure both are in sync. -- Dmitry Olshansky
Re: Compilation strategy
On Tuesday, 18 December 2012 at 17:30:41 UTC, Walter Bright wrote: If I was doing it, and speed was paramount, I'd probably fix it to generate native code instead of bytecode and so execute code directly. Even simple JITs dramatically speeded up the early Java VMs. Could you re-use the compiler recursively to first compile and run CTFE, followed by the rest? --rt
Re: Compilation strategy
On 12/18/2012 11:26 AM, H. S. Teoh wrote: Um, it does introduce major support costs for porting to different CPU targets. [...] Could you elaborate? Sure. You have to rewrite it when going from 32 to 64 bit code, or to ARM, or to any other processor. It's not the same as the regular code generator.
Re: Compilation strategy
On 12/18/2012 11:58 AM, Dmitry Olshansky wrote: The same bytecode then could be be used for external representation. Sigh, there is (again) no point to an external bytecode.
Re: Compilation strategy
On 12/18/2012 11:23 AM, H. S. Teoh wrote: On Tue, Dec 18, 2012 at 09:55:57AM -0800, Walter Bright wrote: On 12/18/2012 9:42 AM, H. S. Teoh wrote: I was thinking more along the lines of things like fully automatic purity, safety, exception inference. For example, every function body eventually has to be processed by the compiler, so if a particular function is inferred to throw exception X, for example, then when its callers are compiled, this fact can be propagated to them. To do this for the whole program might be infeasible due to the sheer size of things, but if a module contains, for each function exposed by the API, a list of all thrown exceptions, then when the module is imported this information is available up-front and can be propagated further up the call chain. Same thing goes with purity and @safe. This may even allow us to make pure/@safe/nothrow fully automated so that you don't have to explicitly state them (except when you want the compiler to verify that what you wrote is actually pure, safe, etc.). The trouble with this is the separate compilation model. If the attributes are not in the function signature, then the function implementation can change without recompiling the user of that function. Changing the inferred attributes then will subtly break your build. And here's a reason for using an intermediate format (whether it's bytecote or just plain serialized AST or something else, is irrelevant). Say we put the precompiled module in a zip file of some sort. If the function attributes change, so does the zip file. So if proper make dependencies are setup, this will automatically trigger the recompilation of whoever uses the module. Relying on a makefile being correct does not solve it. Inferred attributes only work when the implementation source is guaranteed to be available, such as with template functions. Having a binary format doesn't change this. Actually, this doesn't depend on the format being binary. You can save everything in plain text format and it will still work. In fact, there might be reasons to want a text format instead of binary, since then one could look at the compiler output to find out what the inferred attributes of a particular declaration are without needing to add compiler querying features. The "plain text format" that works is called D source code :-)
Re: Compilation strategy
On 12/18/2012 08:26 PM, H. S. Teoh wrote: On Tue, Dec 18, 2012 at 10:06:43AM -0800, Walter Bright wrote: On 12/18/2012 9:49 AM, H. S. Teoh wrote: Is it too late to change CTFE to work via native code? No, because doing so involves zero language changes. It is purely a quality-of-implementation issue. Well, that much is obvious; I was just wondering if the current implementation will require too much effort to make it work with native code CTFE. Besides the effort required to rework the existing code (and perhaps the cross-compiling issue, though I don't see it as a major issue), Um, it does introduce major support costs for porting to different CPU targets. [...] Could you elaborate? In my mind, there's not much additional cost to what's already involved in targeting a particular CPU in the first place. Since the CPU is already targeted, we generate native code for it and run that during CTFE. ... The generated native code would need to be different in order to support proper error reporting and dependency handling. (The generated code must be able to communicate with the analyzer/jit compiler.) The compiler does not have the full picture during analysis. It needs to figure out what information a CTFE-call depends on. The only way that works in general is running it. Eg: string good(){ mixin(foo(0,()=>good())); // ok, delegate never called } string bad(){ mixin(foo(1,()=>bad())); // error, need body of bad to generate body of bad } string foo(bool b, string delegate() dg){ if(b) return dg(); return q{return "return 0;";}; }
Re: Compilation strategy
Am 18.12.2012 21:09, schrieb Dmitry Olshansky: 12/19/2012 12:01 AM, Jacob Carlborg пишет: On 2012-12-18 17:48, Dmitry Olshansky wrote: I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy it if need be). dumpobj is included in the DMD release, at least on Mac OS X. And linux has it. Guess Windows sucks ... dumpbin
Re: Compilation strategy
12/19/2012 12:01 AM, Jacob Carlborg пишет: On 2012-12-18 17:48, Dmitry Olshansky wrote: I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy it if need be). dumpobj is included in the DMD release, at least on Mac OS X. And linux has it. Guess Windows sucks ... -- Dmitry Olshansky
Re: Compilation strategy
On 2012-12-18 17:48, Dmitry Olshansky wrote: I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy it if need be). dumpobj is included in the DMD release, at least on Mac OS X. -- /Jacob Carlborg
Re: Compilation strategy
12/18/2012 9:30 PM, Walter Bright пишет: On 12/18/2012 8:57 AM, Dmitry Olshansky wrote: But adequate bytecode designed for interpreters (see e.g. Lua) are designed for faster execution. The way CTFE is done now* is a polymorphic call per AST-node that does a lot of analysis that could be decided once and stored in ... *ehm* ... IR. Currently it's also somewhat mixed with semantic analysis (thus rising the complexity). The architectural failings of CTFE are primary my fault from taking an implementation shortcut and building it out of enhancing the constant folding code. They are not a demonstration of inherent superiority of one scheme or another. Nor does CTFE's problems indicate that modules should be represented as bytecode externally. Agreed. It seemed to me that since CTFE implements an interpreter for D it would be useful to define a flattened representation of semantically analyzed AST that is tailored for execution. The same bytecode then could be be used for external representation. There is however a problem of templates that can only be analyzed on instantiation. Then indeed we can't fully "precompile" semantic step into bytecode meaning that it won't be much beyond flattened result of parse step. So on this second thought it may not that useful after all. Another point is that pointer chasing data-structures is not a recipe for fast repeated execution. To provide an analogy: executing calculation recursively on AST tree of expression is bound to be slower then running the same calculation straight on sanely encoded flat reverse-polish notation. A hit below belt: also peek at your own DMDScript - why bother with plain IR (_bytecode_!) for JavaScript if it could just fine be interpreted as is on AST-s? Give me some credit for learning something over the last 12 years! I'm not at all convinced I'd use the same design if I were doing it now. OK ;) If I was doing it, and speed was paramount, I'd probably fix it to generate native code instead of bytecode and so execute code directly. Even simple JITs dramatically speeded up the early Java VMs. Granted JIT is faster but I'm personally more interested in portable interpreters. I've been digging around and gathering techniques and so far it looks rather promising. Though I need more field testing... and computed gotos in D! Or more specifically a way to _force_ tail-call. -- Dmitry Olshansky
Re: Compilation strategy
On Tue, Dec 18, 2012 at 10:06:43AM -0800, Walter Bright wrote: > On 12/18/2012 9:49 AM, H. S. Teoh wrote: > >Is it too late to change CTFE to work via native code? > > No, because doing so involves zero language changes. It is purely a > quality-of-implementation issue. Well, that much is obvious; I was just wondering if the current implementation will require too much effort to make it work with native code CTFE. > >Besides the effort required to rework the existing code (and perhaps > >the cross-compiling issue, though I don't see it as a major issue), > > Um, it does introduce major support costs for porting to different CPU > targets. [...] Could you elaborate? In my mind, there's not much additional cost to what's already involved in targeting a particular CPU in the first place. Since the CPU is already targeted, we generate native code for it and run that during CTFE. Or are you referring to cross-compiling? T -- Tech-savvy: euphemism for nerdy.
Re: Compilation strategy
On Tue, Dec 18, 2012 at 09:55:57AM -0800, Walter Bright wrote: > On 12/18/2012 9:42 AM, H. S. Teoh wrote: > >I was thinking more along the lines of things like fully automatic > >purity, safety, exception inference. For example, every function body > >eventually has to be processed by the compiler, so if a particular > >function is inferred to throw exception X, for example, then when its > >callers are compiled, this fact can be propagated to them. To do this > >for the whole program might be infeasible due to the sheer size of > >things, but if a module contains, for each function exposed by the > >API, a list of all thrown exceptions, then when the module is > >imported this information is available up-front and can be propagated > >further up the call chain. Same thing goes with purity and @safe. > > > >This may even allow us to make pure/@safe/nothrow fully automated so > >that you don't have to explicitly state them (except when you want > >the compiler to verify that what you wrote is actually pure, safe, > >etc.). > > The trouble with this is the separate compilation model. If the > attributes are not in the function signature, then the function > implementation can change without recompiling the user of that > function. Changing the inferred attributes then will subtly break > your build. And here's a reason for using an intermediate format (whether it's bytecote or just plain serialized AST or something else, is irrelevant). Say we put the precompiled module in a zip file of some sort. If the function attributes change, so does the zip file. So if proper make dependencies are setup, this will automatically trigger the recompilation of whoever uses the module. > Inferred attributes only work when the implementation source is > guaranteed to be available, such as with template functions. > > Having a binary format doesn't change this. Actually, this doesn't depend on the format being binary. You can save everything in plain text format and it will still work. In fact, there might be reasons to want a text format instead of binary, since then one could look at the compiler output to find out what the inferred attributes of a particular declaration are without needing to add compiler querying features. T -- Why have vacation when you can work?? -- EC
Re: Compilation strategy
On 12/18/2012 9:49 AM, H. S. Teoh wrote: Is it too late to change CTFE to work via native code? No, because doing so involves zero language changes. It is purely a quality-of-implementation issue. Besides the effort required to rework the existing code (and perhaps the cross-compiling issue, though I don't see it as a major issue), Um, it does introduce major support costs for porting to different CPU targets.
Re: Compilation strategy
On 12/18/2012 9:42 AM, H. S. Teoh wrote: I was thinking more along the lines of things like fully automatic purity, safety, exception inference. For example, every function body eventually has to be processed by the compiler, so if a particular function is inferred to throw exception X, for example, then when its callers are compiled, this fact can be propagated to them. To do this for the whole program might be infeasible due to the sheer size of things, but if a module contains, for each function exposed by the API, a list of all thrown exceptions, then when the module is imported this information is available up-front and can be propagated further up the call chain. Same thing goes with purity and @safe. This may even allow us to make pure/@safe/nothrow fully automated so that you don't have to explicitly state them (except when you want the compiler to verify that what you wrote is actually pure, safe, etc.). The trouble with this is the separate compilation model. If the attributes are not in the function signature, then the function implementation can change without recompiling the user of that function. Changing the inferred attributes then will subtly break your build. Inferred attributes only work when the implementation source is guaranteed to be available, such as with template functions. Having a binary format doesn't change this.
Re: Compilation strategy
On Tue, Dec 18, 2012 at 09:30:40AM -0800, Walter Bright wrote: > On 12/18/2012 8:57 AM, Dmitry Olshansky wrote: [...] > >Another point is that pointer chasing data-structures is not a recipe > >for fast repeated execution. > > > >To provide an analogy: executing calculation recursively on AST tree > >of expression is bound to be slower then running the same calculation > >straight on sanely encoded flat reverse-polish notation. > > > >A hit below belt: also peek at your own DMDScript - why bother with > >plain IR (_bytecode_!) for JavaScript if it could just fine be > >interpreted as is on AST-s? > > Give me some credit for learning something over the last 12 years! > I'm not at all convinced I'd use the same design if I were doing it > now. > > If I was doing it, and speed was paramount, I'd probably fix it to > generate native code instead of bytecode and so execute code > directly. Even simple JITs dramatically speeded up the early Java > VMs. [...] Is it too late to change CTFE to work via native code? Besides the effort required to rework the existing code (and perhaps the cross-compiling issue, though I don't see it as a major issue), I see a lot of advantages to doing that. For one thing, it will solve the current complaints about CTFE speed and memory usage (a native code implementation would allow using a GC to keep memory footprint down, or perhaps just a sandbox that can be ditched after evaluation and its memory reclaimed). T -- Obviously, some things aren't very obvious.
Re: Compilation strategy
On Tue, Dec 18, 2012 at 08:12:51AM -0800, Walter Bright wrote: > On 12/18/2012 7:51 AM, H. S. Teoh wrote: > >An idea occurred to me while reading this. What if, when compiling a > >module, say, the compiler not only emits object code, but also > >information like which functions are implied to be strongly pure, > >weakly pure, @safe, etc., as well as some kind of symbol dependency > >information. Basically, any derived information that isn't > >immediately obvious from the code is saved. > > > >Then when importing the module, the compiler doesn't have to > >re-derive all of this information, but it is immediately available. > > > >One can also include information like whether a function actually > >throws an exception (regardless of whether it's marked nothrow), > >which exception(s) it throws, etc.. This may open up the possibility > >of doing some things with the language that are currently infeasible, > >regardless of the obfuscation issue. > > This is a binary import. It offers negligible advantages over .di > files. I was thinking more along the lines of things like fully automatic purity, safety, exception inference. For example, every function body eventually has to be processed by the compiler, so if a particular function is inferred to throw exception X, for example, then when its callers are compiled, this fact can be propagated to them. To do this for the whole program might be infeasible due to the sheer size of things, but if a module contains, for each function exposed by the API, a list of all thrown exceptions, then when the module is imported this information is available up-front and can be propagated further up the call chain. Same thing goes with purity and @safe. This may even allow us to make pure/@safe/nothrow fully automated so that you don't have to explicitly state them (except when you want the compiler to verify that what you wrote is actually pure, safe, etc.). T -- Тише едешь, дальше будешь.
Re: Compilation strategy
On 12/18/2012 8:57 AM, Dmitry Olshansky wrote: But adequate bytecode designed for interpreters (see e.g. Lua) are designed for faster execution. The way CTFE is done now* is a polymorphic call per AST-node that does a lot of analysis that could be decided once and stored in ... *ehm* ... IR. Currently it's also somewhat mixed with semantic analysis (thus rising the complexity). The architectural failings of CTFE are primary my fault from taking an implementation shortcut and building it out of enhancing the constant folding code. They are not a demonstration of inherent superiority of one scheme or another. Nor does CTFE's problems indicate that modules should be represented as bytecode externally. Another point is that pointer chasing data-structures is not a recipe for fast repeated execution. To provide an analogy: executing calculation recursively on AST tree of expression is bound to be slower then running the same calculation straight on sanely encoded flat reverse-polish notation. A hit below belt: also peek at your own DMDScript - why bother with plain IR (_bytecode_!) for JavaScript if it could just fine be interpreted as is on AST-s? Give me some credit for learning something over the last 12 years! I'm not at all convinced I'd use the same design if I were doing it now. If I was doing it, and speed was paramount, I'd probably fix it to generate native code instead of bytecode and so execute code directly. Even simple JITs dramatically speeded up the early Java VMs.
Re: Compilation strategy
On 12/18/2012 8:54 AM, Andrei Alexandrescu wrote: On 12/18/12 10:01 AM, Walter Bright wrote: On 12/18/2012 1:43 AM, Dmitry Olshansky wrote: Compared to doing computations on AST tries (and looking up every name in symbol table?), creating fake nodes when the result is computed etc? CTFE does not look up every (or any) name in the symbol table. I don't see any advantage to interpreting bytecode over interpreting ASTs. In fact, all the Java bytecode is is a serialized AST. My understanding is that Java bytecode is somewhat lowered e.g. using a stack machine for arithmetic, jumps etc. which makes it more amenable to interpretation than what an AST walker would do. The Java bytecode is indeed a stack machine, and a stack machine *is* a serialized AST. Also bytecode is more directly streamable because you don't need any pointer fixups. A stack machine *is* a streamable representation of an AST. They are trivially convertible back and forth between each other, and I mean trivially.
Re: Compilation strategy
On 12/18/2012 8:48 AM, Dmitry Olshansky wrote: After dropping debug info I can't yet make heads or tails of what's in the exe yet but it _seems_ to not include all of the unused code. Gotta investigate on a smaller sample. Generate a linker .map file (-map to dmd). That'll tell you what's in it.
Re: Compilation strategy
12/18/2012 7:01 PM, Walter Bright пишет: On 12/18/2012 1:43 AM, Dmitry Olshansky wrote: Compared to doing computations on AST tries (and looking up every name in symbol table?), creating fake nodes when the result is computed etc? CTFE does not look up every (or any) name in the symbol table. I stand corrected - ditch "the looking up every name in symbol table". Honestly I've deduced that from your statement: >>>the type information and AST trees and symbol table. Note the symbol table. Looking inside I cannot immediately grasp if it ever uses it. I see that e.g. variables are tied to nodes that represent declarations, values to expression nodes of already processed AST. I don't see any advantage to interpreting bytecode over interpreting ASTs. In fact, all the Java bytecode is is a serialized AST. We need no stinkin' Java ;) But adequate bytecode designed for interpreters (see e.g. Lua) are designed for faster execution. The way CTFE is done now* is a polymorphic call per AST-node that does a lot of analysis that could be decided once and stored in ... *ehm* ... IR. Currently it's also somewhat mixed with semantic analysis (thus rising the complexity). Another point is that pointer chasing data-structures is not a recipe for fast repeated execution. To provide an analogy: executing calculation recursively on AST tree of expression is bound to be slower then running the same calculation straight on sanely encoded flat reverse-polish notation. A hit below belt: also peek at your own DMDScript - why bother with plain IR (_bytecode_!) for JavaScript if it could just fine be interpreted as is on AST-s? *I judge by a cursory look at source and bits that Don sometimes shares about it. -- Dmitry Olshansky
Re: Compilation strategy
On 12/18/12 10:01 AM, Walter Bright wrote: On 12/18/2012 1:43 AM, Dmitry Olshansky wrote: Compared to doing computations on AST tries (and looking up every name in symbol table?), creating fake nodes when the result is computed etc? CTFE does not look up every (or any) name in the symbol table. I don't see any advantage to interpreting bytecode over interpreting ASTs. In fact, all the Java bytecode is is a serialized AST. My understanding is that Java bytecode is somewhat lowered e.g. using a stack machine for arithmetic, jumps etc. which makes it more amenable to interpretation than what an AST walker would do. Also bytecode is more directly streamable because you don't need any pointer fixups. In brief I agree there's an isomorphism between bytecode and AST representation, but there are a few differences that may be important to certain applications. Andrei
Re: Compilation strategy
12/18/2012 6:51 PM, Walter Bright пишет: On 12/18/2012 1:33 AM, Dmitry Olshansky wrote: More then that - the end result is the same: to avoid carrying junk into an app you (or compiler) still have to put each function in its own section. That's what COMDATs are. Okay.. Doing separate compilation I always (unless doing LTO or template heavy code) see either whole or nothing (D included). Most likely the compiler will do it for you only with a special switch. dmd emits COMDATs for all global functions. You can see this by running dumpobj on the output. Thanks for carrying on this Q. I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy it if need be). However I see comments in dumped asm that mark section boundaries that all functions are indeed in COMDAT sections. Still linking these object files and disassembling output I see all of functions are there intact. I've added debug symbols to the build though - could it make optlink keep symbols? After dropping debug info I can't yet make heads or tails of what's in the exe yet but it _seems_ to not include all of the unused code. Gotta investigate on a smaller sample. -- Dmitry Olshansky
Re: Compilation strategy
On 12/18/2012 7:51 AM, H. S. Teoh wrote: An idea occurred to me while reading this. What if, when compiling a module, say, the compiler not only emits object code, but also information like which functions are implied to be strongly pure, weakly pure, @safe, etc., as well as some kind of symbol dependency information. Basically, any derived information that isn't immediately obvious from the code is saved. Then when importing the module, the compiler doesn't have to re-derive all of this information, but it is immediately available. One can also include information like whether a function actually throws an exception (regardless of whether it's marked nothrow), which exception(s) it throws, etc.. This may open up the possibility of doing some things with the language that are currently infeasible, regardless of the obfuscation issue. This is a binary import. It offers negligible advantages over .di files.
Re: Compilation strategy
On Tue, Dec 18, 2012 at 07:01:28AM -0800, Walter Bright wrote: > On 12/18/2012 1:43 AM, Dmitry Olshansky wrote: > >Compared to doing computations on AST tries (and looking up every > >name in symbol table?), creating fake nodes when the result is > >computed etc? > > CTFE does not look up every (or any) name in the symbol table. I don't > see any advantage to interpreting bytecode over interpreting ASTs. In > fact, all the Java bytecode is is a serialized AST. I've always thought that CTFE should run native code. Yes, I'm aware of the objections related to cross-compiling, etc., but honestly, how many people actually use a cross-compiler (or know what it is)? Interpreting CTFE to me should be a fallback, not the default mode of implementation. T -- Acid falls with the rain; with love comes the pain.
Re: Compilation strategy
On Tue, Dec 18, 2012 at 06:55:34AM -0800, Walter Bright wrote: > On 12/18/2012 3:43 AM, foobar wrote: > >Honest question - If D already has all the semantic info in COMDAT > >sections, > > It doesn't. COMDATs are object file sections. They do not contain > type info, for example. > > > * provide a byte-code solution to support the portability case. e.g > > Java byte-code or Google's pNaCL solution that relies on LLVM > > bit-code. > > There is no advantage to bytecodes. Putting them in a zip file does > not make them produce better results. [...] An idea occurred to me while reading this. What if, when compiling a module, say, the compiler not only emits object code, but also information like which functions are implied to be strongly pure, weakly pure, @safe, etc., as well as some kind of symbol dependency information. Basically, any derived information that isn't immediately obvious from the code is saved. Then when importing the module, the compiler doesn't have to re-derive all of this information, but it is immediately available. One can also include information like whether a function actually throws an exception (regardless of whether it's marked nothrow), which exception(s) it throws, etc.. This may open up the possibility of doing some things with the language that are currently infeasible, regardless of the obfuscation issue. T -- There are three kinds of people in the world: those who can count, and those who can't.
Re: Compilation strategy
On 12/18/2012 1:43 AM, Dmitry Olshansky wrote: Compared to doing computations on AST tries (and looking up every name in symbol table?), creating fake nodes when the result is computed etc? CTFE does not look up every (or any) name in the symbol table. I don't see any advantage to interpreting bytecode over interpreting ASTs. In fact, all the Java bytecode is is a serialized AST.
Re: Compilation strategy
On 12/18/2012 3:43 AM, foobar wrote: Honest question - If D already has all the semantic info in COMDAT sections, It doesn't. COMDATs are object file sections. They do not contain type info, for example. * provide a byte-code solution to support the portability case. e.g Java byte-code or Google's pNaCL solution that relies on LLVM bit-code. There is no advantage to bytecodes. Putting them in a zip file does not make them produce better results.
Re: Compilation strategy
On 12/18/2012 1:33 AM, Dmitry Olshansky wrote: More then that - the end result is the same: to avoid carrying junk into an app you (or compiler) still have to put each function in its own section. That's what COMDATs are. Doing separate compilation I always (unless doing LTO or template heavy code) see either whole or nothing (D included). Most likely the compiler will do it for you only with a special switch. dmd emits COMDATs for all global functions. You can see this by running dumpobj on the output.
Re: Compilation strategy
On Tuesday, 18 December 2012 at 11:43:18 UTC, foobar wrote: On Monday, 17 December 2012 at 22:24:00 UTC, Walter Bright wrote: On 12/17/2012 2:08 PM, Dmitry Olshansky wrote: I really loved the way Turbo Pascal units were made. I wish D go the same route. Object files would then be looked at as minimal and stupid variation of module where symbols are identified by mangling (not plain meta data as (would be) in module) and no source for templates is emitted. +1 I'll bite. How is this superior to D's system? I have never used TP. *Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication. This is done using COMDATs in C++ and D today. Honest question - If D already has all the semantic info in COMDAT sections, why do we still require additional auxiliary files? Surely, a single binary library (lib/so) should be enough to encapsulate a library without the need to re-parse the source files or additional header files? You yourself seem to agree that a single zip file is superior to what we currently have and as an aside the entire Java community agrees with use - Java Jar/War/etc formats are all renamed zip archives. Regarding the obfuscation and portability issues - the zip file can contain whatever we want. This means it should be possible to tailor the contents to support different use-cases: * provide fat-libraries as in OSX - internally store multiple binaries for different architectures, those binary objects are very hard to decompile back to source code thus answering the obfuscation need. * provide a byte-code solution to support the portability case. e.g Java byte-code or Google's pNaCL solution that relies on LLVM bit-code. Also, there are different work-flows that can be implemented - Java uses JIT to gain efficiency vs. .NET that supports install-time AOT compilation. It basically stores the native executable in a special cache. In Windows 8 RT, .NET binaries are actually compiled to native code when uploaded to the Windows App Store.
Re: Compilation strategy
On 12/18/12, foobar wrote: > Besides, the other compilers merge in the same front-end > code so they'll gain the same feature anyway. There's no gain in > separating it out to rdmd. Adding more front-end features adds more work for maintainers of compilers which are based on the DMD front-end, and not all compilers are based on the DMD front-end. Don't forget the huge gain of using D over C++ to implement the feature.
Re: Compilation strategy
On Tuesday, 18 December 2012 at 00:48:40 UTC, Walter Bright wrote: Wow, I think that's exactly what we could use! It serves multiple optional use cases all at once! Was there a technical reason for you not getting around towards implementing, or just a lack of time? There always seemed something more important to be doing, and Andrei thought it would be better to put such a capability in rdmd rather than dmd. This is inconsistent with D's design - providing useful features built-in (docs generator, testing, profiling, etc). More over, it breaks encapsulation. This means the compiler exposes an inferior format that will later be wrapped around by a more capable packaging format, thus exposing the implementation details and adding an external dependency on that inferior format. Besides, the other compilers merge in the same front-end code so they'll gain the same feature anyway. There's no gain in separating it out to rdmd. The main question is if you approve the concept and willing to put it on the to-do list? I'm sure that if you endorse this feature someone else will come in and implement it.
Re: Compilation strategy
On Monday, 17 December 2012 at 22:24:00 UTC, Walter Bright wrote: On 12/17/2012 2:08 PM, Dmitry Olshansky wrote: I really loved the way Turbo Pascal units were made. I wish D go the same route. Object files would then be looked at as minimal and stupid variation of module where symbols are identified by mangling (not plain meta data as (would be) in module) and no source for templates is emitted. +1 I'll bite. How is this superior to D's system? I have never used TP. *Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication. This is done using COMDATs in C++ and D today. Honest question - If D already has all the semantic info in COMDAT sections, why do we still require additional auxiliary files? Surely, a single binary library (lib/so) should be enough to encapsulate a library without the need to re-parse the source files or additional header files? You yourself seem to agree that a single zip file is superior to what we currently have and as an aside the entire Java community agrees with use - Java Jar/War/etc formats are all renamed zip archives. Regarding the obfuscation and portability issues - the zip file can contain whatever we want. This means it should be possible to tailor the contents to support different use-cases: * provide fat-libraries as in OSX - internally store multiple binaries for different architectures, those binary objects are very hard to decompile back to source code thus answering the obfuscation need. * provide a byte-code solution to support the portability case. e.g Java byte-code or Google's pNaCL solution that relies on LLVM bit-code. Also, there are different work-flows that can be implemented - Java uses JIT to gain efficiency vs. .NET that supports install-time AOT compilation. It basically stores the native executable in a special cache.
Re: Compilation strategy
On Tuesday, 18 December 2012 at 00:15:04 UTC, H. S. Teoh wrote: On Tue, Dec 18, 2012 at 02:08:55AM +0400, Dmitry Olshansky wrote: [...] I suspect it's one of prime examples where UNIX philosophy of combining a bunch of simple (~ dumb) programs together in place of one more complex program was taken *far* beyond reasonable lengths. Having a pipe-line: preprocessor -> compiler -> (still?) assembler -> linker where every program tries hard to know nothing about the previous ones (and be as simple as possibly can be) is bound to get inadequate results on many fronts: - efficiency & scalability - cross-border error reporting and detection (linker errors? errors for expanded macro magic?) - cross-file manipulations (e.g. optimization, see _how_ LTO is done in GCC) - multiple problems from a loss of information across pipeline* The problem is not so much the structure preprocessor -> compiler -> assembler -> linker; the problem is that these logical stages have been arbitrarily assigned to individual processes residing in their own address space, communicating via files (or pipes, whatever it may be). The fact that they are separate processes is in itself not that big of a problem, but the fact that they reside in their own address space is a big problem, because you cannot pass any information down the chain except through rudimentary OS interfaces like files and pipes. Even that wouldn't have been so bad, if it weren't for the fact that user interface (in the form of text input / object file format) has also been conflated with program interface (the compiler has to produce the input to the assembler, in *text*, and the assembler has to produce object files that do not encode any direct dependency information because that's the standard file format the linker expects). Now consider if we keep the same stages, but each stage is not a separate program but a *library*. The code then might look, in greatly simplified form, something like this: import libdmd.compiler; import libdmd.assembler; import libdmd.linker; void main(string[] args) { // typeof(asmCode) is some arbitrarily complex data // structure encoding assembly code, inter-module // dependencies, etc. auto asmCode = compiler.lex(args) .parse() .optimize() .codegen(); // Note: no stupid redundant convert to string, parse, // convert back to internal representation. auto objectCode = assembler.assemble(asmCode); // Note: linker has direct access to dependency info, // etc., carried over from asmCode -> objectCode. auto executable = linker.link(objectCode); File output(outfile, "w"); executable.generate(output); } Note that the types asmCode, objectCode, executable, are arbitrarily complex, and may contain lazy-evaluated data structure, references to on-disk temporary storage (for large projects you can't hold everything in RAM), etc.. Dependency information in asmCode is propagated to objectCode, as necessary. The linker has full access to all info the compiler has access to, and can perform inter-module optimization, etc., by accessing information available to the *compiler* front-end, not just some crippled object file format. The root of the current nonsense is that perfectly-fine data structures are arbitrarily required to be flattened into some kind of intermediate form, written to some file (or sent down some pipe), often with loss of information, then read from the other end, interpreted, and reconstituted into other data structures (with incomplete info), then processed. In many cases, information that didn't make it through the channel has to be reconstructed (often imperfectly), and then used. Most of these steps are redundant. If the compiler data structures were already directly available in the first place, none of this baroque dance is necessary. *Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication. While simplicity (and correspondingly size in memory) of programs was the king in 70's it's well past due. Nowadays I think is all about getting highest throughput and more powerful features. [...] Simplicity is good. Simplicity lets you modularize a very complex piece of software (a compiler that converts D source code into executables) into manageable chunks. Simplicity does not require shoe-horning modules into separate programs with separate address spaces with separate (and deficient) input/output formats. The problem is
Re: Compilation strategy
12/18/2012 4:42 AM, Walter Bright пишет: On 12/17/2012 3:03 PM, deadalnix wrote: I know that. I not arguing against that. I'm arguing against the fact that this is a blocker. This is blocker in very few use cases in fact. I just look at the whole picture here. People needing that are the exception, not the rule. I'm not sure what you mean. A blocker for what? And what prevent us from using a bytecode that loose information ? I'd turn that around and ask why have a bytecode? As long as it is CTFEable, most people will be happy. CTFE needs the type information and AST trees and symbol table. Everything needed for decompilation. The fact that CTFE has to crawl AST trees is AFAIK a mere happenstance. It does help nothing but the way to hack it into the current compiler structure. There should be a far more suitable IR (if you don't like the bytecode term) if we are to run CTFE at least at marginally comparable to run-time speeds. I know that bytecode has been around since 1995 in its current incarnation, and there's an ingrained assumption that since there's such an extensive ecosystem around it, that there is some advantage to it. I don't care for ecosystems. And there is none involved in the argument. But there isn't. Compared to doing computations on AST tries (and looking up every name in symbol table?), creating fake nodes when the result is computed etc? I'm out of words. -- Dmitry Olshansky
Re: Compilation strategy
12/18/2012 2:23 AM, Walter Bright пишет: On 12/17/2012 2:08 PM, Dmitry Olshansky wrote: I really loved the way Turbo Pascal units were made. I wish D go the same route. Object files would then be looked at as minimal and stupid variation of module where symbols are identified by mangling (not plain meta data as (would be) in module) and no source for templates is emitted. +1 I'll bite. How is this superior to D's system? I have never used TP. One superiority is having a compiled module with public interface (a-la .di but in some binary format) in one file. Along with public interface it retains dependency information. Basically things that describe one entity should not be separated. I can say that advantage of "grab this single file and you are good to go" should not be underestimated. Thusly there is no mess with header files out of date and/or object files that fail to link because of that. Now back then there were no templates nor CTFE. So module structure was simple. There were no packages (they landed in Delphi). I'd expect D to have a format built around modules and packages of these. Then pre-compiled libraries are commonly distributed as a package. The upside of having our special format is being able to tailor it for our needs e.g. store type info & meta-data plainly (not mangle-demangle), having separately compiled (and checked) pure functions, better cross symbol dependency etc. To link with C we could still compile all of D modules into a huge object file (split into a monstrous amount of sections). *Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication. This is done using COMDATs in C++ and D today. Well, that's terse. Either way it looks like a workaround for templates that during separate compilation dump identical code in obj-s to auto-merge these. More then that - the end result is the same: to avoid carrying junk into an app you (or compiler) still have to put each function in its own section. Doing separate compilation I always (unless doing LTO or template heavy code) see either whole or nothing (D included). Most likely the compiler will do it for you only with a special switch. This begs another question - why not eliminate junk by default? P.S. Looking at M$: http://msdn.microsoft.com/en-us/library/xsa71f43.aspx it needs 2 switches - 1 for linker 1 for compiler. Hilarious. -- Dmitry Olshansky
Re: Compilation strategy
On Tuesday, 18 December 2012 at 07:48:01 UTC, Jacob Carlborg wrote: On 2012-12-17 23:12, Walter Bright wrote: I have toyed with the idea many times, however, of having dmd support zip files. Zip files can contain an arbitrary file hierarchy, with individual files in compressed, encrypted, or plaintext at the selection of the zip builder. An entire project, or library, or collection of source modules can be distributed as a zip file, and could be compiled with nothing more than: I think that a package manager should handle this. Example: https://github.com/jacob-carlborg/orbit/wiki/Orbit-Package-Manager-for-D Yes I can change the orbfiles to be written in D. Hehe. :) I even checked out your code with that idea in mind, but other things keep having higher priority. -- Paulo
Re: Compilation strategy
On 12/17/2012 11:40 PM, Jacob Carlborg wrote: On 2012-12-17 00:09, Walter Bright wrote: Figure out the cases where it happens and fix those cases. How is it supposed to work? Could there be some issue with the dependency tracker that should otherwise have indicated that more modules should have been recompiled. It should only generate function bodies that are needed, not all of them.
Re: Compilation strategy
On 2012-12-17 23:12, Walter Bright wrote: I have toyed with the idea many times, however, of having dmd support zip files. Zip files can contain an arbitrary file hierarchy, with individual files in compressed, encrypted, or plaintext at the selection of the zip builder. An entire project, or library, or collection of source modules can be distributed as a zip file, and could be compiled with nothing more than: I think that a package manager should handle this. Example: https://github.com/jacob-carlborg/orbit/wiki/Orbit-Package-Manager-for-D Yes I can change the orbfiles to be written in D. -- /Jacob Carlborg
Re: Compilation strategy
On 2012-12-17 00:09, Walter Bright wrote: Figure out the cases where it happens and fix those cases. How is it supposed to work? Could there be some issue with the dependency tracker that should otherwise have indicated that more modules should have been recompiled. -- /Jacob Carlborg
Re: Compilation strategy
On 2012-12-18 01:13, H. S. Teoh wrote: The problem is not so much the structure preprocessor -> compiler -> assembler -> linker; the problem is that these logical stages have been arbitrarily assigned to individual processes residing in their own address space, communicating via files (or pipes, whatever it may be). The fact that they are separate processes is in itself not that big of a problem, but the fact that they reside in their own address space is a big problem, because you cannot pass any information down the chain except through rudimentary OS interfaces like files and pipes. Even that wouldn't have been so bad, if it weren't for the fact that user interface (in the form of text input / object file format) has also been conflated with program interface (the compiler has to produce the input to the assembler, in *text*, and the assembler has to produce object files that do not encode any direct dependency information because that's the standard file format the linker expects). Now consider if we keep the same stages, but each stage is not a separate program but a *library*. The code then might look, in greatly simplified form, something like this: import libdmd.compiler; import libdmd.assembler; import libdmd.linker; void main(string[] args) { // typeof(asmCode) is some arbitrarily complex data // structure encoding assembly code, inter-module // dependencies, etc. auto asmCode = compiler.lex(args) .parse() .optimize() .codegen(); // Note: no stupid redundant convert to string, parse, // convert back to internal representation. auto objectCode = assembler.assemble(asmCode); // Note: linker has direct access to dependency info, // etc., carried over from asmCode -> objectCode. auto executable = linker.link(objectCode); File output(outfile, "w"); executable.generate(output); } Note that the types asmCode, objectCode, executable, are arbitrarily complex, and may contain lazy-evaluated data structure, references to on-disk temporary storage (for large projects you can't hold everything in RAM), etc.. Dependency information in asmCode is propagated to objectCode, as necessary. The linker has full access to all info the compiler has access to, and can perform inter-module optimization, etc., by accessing information available to the *compiler* front-end, not just some crippled object file format. The root of the current nonsense is that perfectly-fine data structures are arbitrarily required to be flattened into some kind of intermediate form, written to some file (or sent down some pipe), often with loss of information, then read from the other end, interpreted, and reconstituted into other data structures (with incomplete info), then processed. In many cases, information that didn't make it through the channel has to be reconstructed (often imperfectly), and then used. Most of these steps are redundant. If the compiler data structures were already directly available in the first place, none of this baroque dance is necessary. I couldn't agree more. -- /Jacob Carlborg
Re: Compilation strategy
On 12/17/12 6:11 PM, Rob T wrote: On Monday, 17 December 2012 at 22:12:01 UTC, Walter Bright wrote: dmd xx foo.zip is equivalent to: unzip foo dmd xx a.d b/c.d d.obj P.S. I've also wanted to use .zip files as the .lib file format (!), as the various .lib formats have nothing over .zip files. Wow, I think that's exactly what we could use! It serves multiple optional use cases all at once! Was there a technical reason for you not getting around towards implementing, or just a lack of time? The latter. I wanted to do it in rdmd for ages. Andrei
Re: Compilation strategy
On Tuesday, 18 December 2012 at 02:48:05 UTC, Walter Bright wrote: Using standard zip tools is a big plus. Yes, but why limit yourself in this way? The easy answer is look at the problems stemming from dmd using an object file format on Win32 that nobody else uses. I definitely didn't say dmd should use a format that no one else uses. In fact I actually said the opposite. Dmd should not be limiting the choices people want to make for themselves. In other words, it should allow people to use whatever formats they wish to use, and not a format imposed on them, such as was pointed out by yourself with the Win32 object file format. When I look at what dmd is, it's monolithic and is very unfriendly to extensibility and re-use, so I was suggesting that you look at ways to free dmd from the current set of restraints that are holding back its use and adoption. My suggestion was that the compiler should be restructured into a re-usable modularized system with built-in user extensibility features. Doing that would be a massive improvement and a very big step forward. --rt
Re: Compilation strategy
On Mon, 17 Dec 2012 00:19:51 -0800, deadalnix wrote: On Monday, 17 December 2012 at 08:02:12 UTC, Adam Wilson wrote: With respect to those who hold one ideology above others, trying to impose those ideals on another is a great way to ensure animosity. What a business does with their code is entirely up to them, and I would guess that even Richard Stallman himself would take issue with trying to impose an ideology on another person. What does that mean for D practically? Using a close-to-home example, imagine if Remedy decided that shipping their ENTIRE codebase in .DI files with the product would cause them to give away some new rendering trick that they came up with that nobody else had. And they decided that this was unacceptable. What would they most likely do? Rewrite the project in C++ and tell the D community to kindly pound sand. A license agreement is not enough to stop a thief. And once the new trick makes it into the wild, as long as a competitor can honestly say they had no idea how they got it (and they probably really don't, as they saw it on a legitimate game development website) the hands of the legal system are tied. But that what I say ! I can't stop myself laughing at people that may think any business can be based on java, PHP or C#. That is a mere dream ! Such technology will simply never get used in companies, because bytecode can be decoded ! I use C# everyday at my job. We have expensive obfuscater's to protect the bytecode. Even then it isn't perfect. But it's good enough. With the metadata model of .NET this isn't a problem for public API's, just tell the obfuscater to ignore everything marked 'public'. However, with DI's being a copy of the plaintext source and NOT bytecode you run the risk of changing the meaning of program in unintended ways. You see, in CIL (.NET bytecode) there are no auto's (var's) or templates (generics) the C# compiler does the work of figuring out what the auto type really should be or what the templates really are is BEFORE it writes out the IL, so later when the obfuscater does its job, there are no templates or auto's for it to deal with. In D, we don't have this option, you either have plaintext, or you have binary code, there is no intermediate step like CIL. Hence we can't use the obfuscation approach. -- Adam Wilson IRC: LightBender Project Coordinator The Horizon Project http://www.thehorizonproject.org/
Re: Compilation strategy
On Tuesday, 18 December 2012 at 02:48:05 UTC, Walter Bright wrote: Using standard zip tools is a big plus. Yes, but why limit yourself in this way? The easy answer is look at the problems stemming from dmd using an object file format on Win32 that nobody else uses. I didn't say dmd should use a format that no one uses. What I did say, is that you should not be limiting the choices people want to make for themselves. The current approach is a self-limiting approach that is unable to make effective use of the resourcefulness of the D community. DMD may be open source, but it's a monolithic system that is very unfriendly to extensibility and re-use. Look at this thread of discussion, it is caused by the inability to make effective use out of the compiler in ways should be of absolutely no concern to you. You should be looking at ways of enabling users to be as free and creative as they can be, and that means you must let them make their own mistakes, not the opposite. --rt
Re: Compilation strategy
On 12/17/2012 6:40 PM, Rob T wrote: On Tuesday, 18 December 2012 at 02:17:25 UTC, Walter Bright wrote: On 12/17/2012 6:13 PM, Rob T wrote: Your suggestion concerning the use of zip files is a good idea, although you mention the encryption algo is very weak, but is there any reason to use a weak encryption algo, and is there even a reason to bother maintaining compatibility with the common zip format? Using standard zip tools is a big plus. Yes, but why limit yourself in this way? The easy answer is look at the problems stemming from dmd using an object file format on Win32 that nobody else uses.
Re: Compilation strategy
On Tuesday, 18 December 2012 at 02:17:25 UTC, Walter Bright wrote: On 12/17/2012 6:13 PM, Rob T wrote: Your suggestion concerning the use of zip files is a good idea, although you mention the encryption algo is very weak, but is there any reason to use a weak encryption algo, and is there even a reason to bother maintaining compatibility with the common zip format? Using standard zip tools is a big plus. Yes, but why limit yourself in this way? I suppose you could provide a choice between different formats, but that's the wrong approach. The compiler should instead be restructured to allow D users to supply their own functionality in the form of user defined plugins, that way you won't have to bother second guessing what people need or don't need, or provide generic one size fits all solutions that no one likes, and you'll gain an army of coders who'll take D into very surprising directions that no one could possibly predict. Another nice fix would be to separate the CTFE interpreter out of the compiler as a loadable library so it can be used outside of the compiler for embedded D scripting, and possibly even for JIT applications. I expect there are a few more significant improvements that could be made simply by making the compiler less monolithic and more modularized. Easier said than done, but it should be done at some point because the advantages are very significant. --rt
Re: Compilation strategy
3) Performance can be improved to (near) native speeds with a JIT compiler. But then you might as well as go native to begin with. Why wait till runtime to do compilation, when it can be done beforehand? The point though is that with a JIT, you can transmit source code (or byte code which is smaller in size) over a wire and have it execute natively on a client machine. You cannot do that with native machine code because the client machine is always an unknown target. But then again, even if we never do this, it makes no difference to *me* -- the current situation is good enough for *me*. The question is whether or not we want to D to be better received by enterprises. Exactly, the *we* part of all this doesn't matter in the slightest, it's what the end user wants that matters. If many potential D users want to hide their code (even if it's trivially hidden), but D won't let them, then they won't use D. It's a very simple equation, but holding on to idealisms will often get in the way of good sense. We already had one corporate user complain in here about the issue, and for everyone who complains there are dozens more who will say nothing at all and just walk away. --rt
Re: Compilation strategy
On 12/17/2012 6:13 PM, Rob T wrote: Your suggestion concerning the use of zip files is a good idea, although you mention the encryption algo is very weak, but is there any reason to use a weak encryption algo, and is there even a reason to bother maintaining compatibility with the common zip format? Using standard zip tools is a big plus.
Re: Compilation strategy
On Tuesday, 18 December 2012 at 01:52:21 UTC, Walter Bright wrote: If we implement a way of "hiding" implementation details that *allows* CTFE and templates (and thus one up the C++ situation), this will create a stronger incentive for D adoption. It doesn't matter if it's not hard to "unhide" the implementation; Yes, it does, because we would be lying if we were pretending this was an effective solution. If you can hide the implementation details for other reasons, then no such claim need to be made at all, in fact you can explicitly warn people that the code is not really hidden should they think otherwise. Your suggestion concerning the use of zip files is a good idea, although you mention the encryption algo is very weak, but is there any reason to use a weak encryption algo, and is there even a reason to bother maintaining compatibility with the common zip format? I would expect that there other compression formats that could be used. --rt
Re: Compilation strategy
On 12/17/2012 5:40 PM, Simen Kjaeraas wrote: .zip already has encryption, Just for the record, zip file "encryption" is trivially broken, and there are free downloadable tools to do that. About all it will do is keep your kid sister from reading your diary.
Re: Compilation strategy
On 12/17/2012 5:28 PM, H. S. Teoh wrote: Using PIMPL only helps if you're trying to hide implementation details of a struct or class. Anything that requires CTFE is out of the question. Templates are out of the question (this was also true with C++). This reduces the incentive to adopt D, since they might as well just stick with C++. We lose. I've never seen any closed-source companies reticent about using C++ because of obfuscation issues, which are the same as for D, so I do not see this as a problem. If we implement a way of "hiding" implementation details that *allows* CTFE and templates (and thus one up the C++ situation), this will create a stronger incentive for D adoption. It doesn't matter if it's not hard to "unhide" the implementation; Yes, it does, because we would be lying if we were pretending this was an effective solution. we don't lose anything (having no way to hide implementation is what we already have), plus it increases our chances of adoption -- esp. by enterprises, who are generally the kind of people who even care about this issue in the first place, and who are the people we *want* to attract. Sounds like a win to me. We'd lose credibility with them, as people will laugh at us over this. But then again, even if we never do this, it makes no difference to *me* -- the current situation is good enough for *me*. The question is whether or not we want to D to be better received by enterprises. As I said, C++ is well received by enterprises. This is not an issue.
Re: Compilation strategy
On Tuesday, 18 December 2012 at 01:30:22 UTC, H. S. Teoh wrote: If we implement a way of "hiding" implementation details that *allows* CTFE and templates (and thus one up the C++ situation), this will create a stronger incentive for D adoption. It doesn't matter if it's not hard to "unhide" the implementation; we don't lose anything (having no way to hide implementation is what we already have), plus it increases our chances of adoption -- esp. by enterprises, who are generally the kind of people who even care about this issue in the first place, and who are the people we *want* to attract. Sounds like a win to me. i agreed with that, involving big guys is necessary to make language live, if we don't than D would become just another fan loved language. it's really bad... well, that's the point.
Re: Compilation strategy
On 2012-12-18, 02:28, H. S. Teoh wrote: If we implement a way of "hiding" implementation details that *allows* CTFE and templates (and thus one up the C++ situation), this will create a stronger incentive for D adoption. It doesn't matter if it's not hard to "unhide" the implementation; we don't lose anything (having no way to hide implementation is what we already have), plus it increases our chances of adoption -- esp. by enterprises, who are generally the kind of people who even care about this issue in the first place, and who are the people we *want* to attract. Sounds like a win to me. .zip already has encryption, and unpacking those files and feeding them to the compiler should be a rather simple tool. Sure, if someone makes it, it could probably become part of the distribution. But making it a part of the compiler seems more than excessive. -- Simen
Re: Compilation strategy
On Mon, Dec 17, 2012 at 04:42:13PM -0800, Walter Bright wrote: > On 12/17/2012 3:03 PM, deadalnix wrote: [...] > >And what prevent us from using a bytecode that loose information ? > > I'd turn that around and ask why have a bytecode? > > > >As long as it is CTFEable, most people will be happy. > > CTFE needs the type information and AST trees and symbol table. > Everything needed for decompilation. > > I know that bytecode has been around since 1995 in its current > incarnation, and there's an ingrained assumption that since there's > such an extensive ecosystem around it, that there is some advantage > to it. > > But there isn't. Now this, I have to agree with. The only advantage to bytecode is that if you have two interpreters on two different platforms, then bytecode on one can run verbatim on the other. But: 1) Bytecode is slower than native code, and always will be. 2) Unless, of course, you're running a machine that runs the bytecode directly. But that just means your code is native to that machine, and the interpreters on other machines are emulators. So you're already using native code anyway. And since you're already at it, might as well just use native code on the other machines, too. 3) Performance can be improved to (near) native speeds with a JIT compiler. But then you might as well as go native to begin with. Why wait till runtime to do compilation, when it can be done beforehand? 4) Bytecode cannot be (easily) linked with native libraries. Various wrappers and other workarounds are necessary. The bytecode/native boundary is often inefficient, because generally there's need of translation between bytecode interpreter data types and native data types. 5) There are other issues, but I can't be bothered to think of them right now. But anyway, this is getting a bit off-topic. The original issue was separate compilation, and .di files. Just for the record, I'd like to state that I am *not* convinced about the need to obfuscate library code (either by using .di or by other means), primarily because it's futile, but also because I believe in open source code. However, I know a LOT of employers and enterprises are NOT comfortable with the idea, and would not so much as consider a particular language/toolchain if they can't at least have the illusion of security. You may say it's silly, and I'd agree, but that does nothing to help adoption. Using PIMPL only helps if you're trying to hide implementation details of a struct or class. Anything that requires CTFE is out of the question. Templates are out of the question (this was also true with C++). This reduces the incentive to adopt D, since they might as well just stick with C++. We lose. If we implement a way of "hiding" implementation details that *allows* CTFE and templates (and thus one up the C++ situation), this will create a stronger incentive for D adoption. It doesn't matter if it's not hard to "unhide" the implementation; we don't lose anything (having no way to hide implementation is what we already have), plus it increases our chances of adoption -- esp. by enterprises, who are generally the kind of people who even care about this issue in the first place, and who are the people we *want* to attract. Sounds like a win to me. But then again, even if we never do this, it makes no difference to *me* -- the current situation is good enough for *me*. The question is whether or not we want to D to be better received by enterprises. T -- I am a consultant. My job is to make your job redundant. -- Mr Tom
Re: Compilation strategy
On 12/17/2012 4:47 PM, deadalnix wrote: On Tuesday, 18 December 2012 at 00:42:13 UTC, Walter Bright wrote: On 12/17/2012 3:03 PM, deadalnix wrote: I know that. I not arguing against that. I'm arguing against the fact that this is a blocker. This is blocker in very few use cases in fact. I just look at the whole picture here. People needing that are the exception, not the rule. I'm not sure what you mean. A blocker for what? And what prevent us from using a bytecode that loose information ? I'd turn that around and ask why have a bytecode? Because it is CTFEable efficiently, without requiring either to recompile the source code or even distribute the source code. I've addressed that issue several times now. I know I'm arguing against scores of billions of dollars invested in JVM bytecode, but the emperor isn't wearing clothes. As long as it is CTFEable, most people will be happy. CTFE needs the type information and AST trees and symbol table. Everything needed for decompilation. You do not need more information that what is in a di file. Yeah, which is source code. I think you just conceded :-) Java and C# put more info in that because of runtime reflection (and still, they are tools to strip most of it, no type info, granted, but everything else), something we don't need. There's nothing to be stripped from .class files without rendering them unusable.
Re: Compilation strategy
On 12/17/2012 3:27 PM, Denis Koroskin wrote: On Mon, 17 Dec 2012 13:47:36 -0800, Walter Bright wrote: I've often thought Java bytecode was a complete joke. It doesn't deliver any of its promises. You could tokenize Java source code, run the result through an lzw compressor, and get the equivalent functionality in every way. Not true at all. Bytecode is semi-optimized, I'm not unaware of that, recall I wrote a Java compiler. The "semi-optimized" is generous. The bytecode simply doesn't allow for any significant optimization. easier to manipulate with (obfuscate, instrument, etc), The obfuscators for bytecode are ineffective. It's probably marginally easier to instrument, but adjusting a Java compiler to emit instrumented code is just as easy. JVM/CLR bytecode is shared by many languages (Java, Scala/C#,F#) so you don't need a separate parser for each language, That's true. But since there's a 1:1 correspondence between bytecode and Java, you can just as easily emit Java from your backend. and there is hardware that supports running JVM bytecode on the metal. There's a huge problem with that approach. Remember I said that bytecode can't be more than trivially optimized? In hardware, there's no optimization, so it's going to be doomed to slow optimization. Even a trivial JIT will beat it, and if you go beyond basic code generation to using a real optimizer, that will beat the pants off of any hardware bytecode machine. Which is why such machines have not caught on. It's the wrong place in the compilation process to put the hardware. Try doing the same with lzw'd source code. Modern CPU design is heavily influenced by the kind of instructions compilers like to emit. So, in a sense, this is already the case and has been for decades. (Note the disuse of some instructions in the 8086 that compilers never emit, and their consequent relegation to having as little silicon as possible reserved for them, and the consequent caveats to "never use those instructions, they are terribly slow".)
Re: Compilation strategy
On 12/17/2012 3:11 PM, Rob T wrote: I suspect most file transport protocols already compress the data, so compressing it ourselves probably accomplishes nothing. There are also compressed filesystems, so storing files in a compressed manner likely accomplishes little. Yes however my understanding is that html based file transfers are often not compressed despite the protocol specifically supporting the feature. The problem is not with the protocol, its that some clients and servers simply do not implement the feature or are in a misconfigured state. HTML as you know is very widely used for transferring files. I don't think fixing misconfigured HTML servers is something D should address. Another thing to consider, is for using byte code for interpretation, that way D could be used directly in game engines in place of LUA or other scripting methods, or even as a replacement for Java Script. Of course you know best if this is practical for a language like D, but maybe a subset of D is practical, I don't know. Again, there is zero advantage over using a bytecode for this rather than using source code. Recall that CTFE is an interpreter. (It has some efficiency problems, but that is not related to the file format.) There is no technical reason why tokenized and compressed D source code cannot be interpreted and effectively serve the role of "bytecode". I'll come out and say that bytecode is probably the biggest software misfeature anyone set a store on :-) Wow, I think that's exactly what we could use! It serves multiple optional use cases all at once! Was there a technical reason for you not getting around towards implementing, or just a lack of time? There always seemed something more important to be doing, and Andrei thought it would be better to put such a capability in rdmd rather than dmd.
Re: Compilation strategy
On Tuesday, 18 December 2012 at 00:42:13 UTC, Walter Bright wrote: On 12/17/2012 3:03 PM, deadalnix wrote: I know that. I not arguing against that. I'm arguing against the fact that this is a blocker. This is blocker in very few use cases in fact. I just look at the whole picture here. People needing that are the exception, not the rule. I'm not sure what you mean. A blocker for what? And what prevent us from using a bytecode that loose information ? I'd turn that around and ask why have a bytecode? Because it is CTFEable efficiently, without requiring either to recompile the source code or even distribute the source code. As long as it is CTFEable, most people will be happy. CTFE needs the type information and AST trees and symbol table. Everything needed for decompilation. You do not need more information that what is in a di file. Java and C# put more info in that because of runtime reflection (and still, they are tools to strip most of it, no type info, granted, but everything else), something we don't need.
Re: Compilation strategy
On 12/17/2012 3:03 PM, deadalnix wrote: I know that. I not arguing against that. I'm arguing against the fact that this is a blocker. This is blocker in very few use cases in fact. I just look at the whole picture here. People needing that are the exception, not the rule. I'm not sure what you mean. A blocker for what? And what prevent us from using a bytecode that loose information ? I'd turn that around and ask why have a bytecode? As long as it is CTFEable, most people will be happy. CTFE needs the type information and AST trees and symbol table. Everything needed for decompilation. I know that bytecode has been around since 1995 in its current incarnation, and there's an ingrained assumption that since there's such an extensive ecosystem around it, that there is some advantage to it. But there isn't.
Re: Compilation strategy
On Tue, Dec 18, 2012 at 02:08:55AM +0400, Dmitry Olshansky wrote: [...] > I suspect it's one of prime examples where UNIX philosophy of > combining a bunch of simple (~ dumb) programs together in place of > one more complex program was taken *far* beyond reasonable lengths. > > Having a pipe-line: > preprocessor -> compiler -> (still?) assembler -> linker > > where every program tries hard to know nothing about the previous > ones (and be as simple as possibly can be) is bound to get > inadequate results on many fronts: > - efficiency & scalability > - cross-border error reporting and detection (linker errors? errors > for expanded macro magic?) > - cross-file manipulations (e.g. optimization, see _how_ LTO is done in GCC) > - multiple problems from a loss of information across pipeline* The problem is not so much the structure preprocessor -> compiler -> assembler -> linker; the problem is that these logical stages have been arbitrarily assigned to individual processes residing in their own address space, communicating via files (or pipes, whatever it may be). The fact that they are separate processes is in itself not that big of a problem, but the fact that they reside in their own address space is a big problem, because you cannot pass any information down the chain except through rudimentary OS interfaces like files and pipes. Even that wouldn't have been so bad, if it weren't for the fact that user interface (in the form of text input / object file format) has also been conflated with program interface (the compiler has to produce the input to the assembler, in *text*, and the assembler has to produce object files that do not encode any direct dependency information because that's the standard file format the linker expects). Now consider if we keep the same stages, but each stage is not a separate program but a *library*. The code then might look, in greatly simplified form, something like this: import libdmd.compiler; import libdmd.assembler; import libdmd.linker; void main(string[] args) { // typeof(asmCode) is some arbitrarily complex data // structure encoding assembly code, inter-module // dependencies, etc. auto asmCode = compiler.lex(args) .parse() .optimize() .codegen(); // Note: no stupid redundant convert to string, parse, // convert back to internal representation. auto objectCode = assembler.assemble(asmCode); // Note: linker has direct access to dependency info, // etc., carried over from asmCode -> objectCode. auto executable = linker.link(objectCode); File output(outfile, "w"); executable.generate(output); } Note that the types asmCode, objectCode, executable, are arbitrarily complex, and may contain lazy-evaluated data structure, references to on-disk temporary storage (for large projects you can't hold everything in RAM), etc.. Dependency information in asmCode is propagated to objectCode, as necessary. The linker has full access to all info the compiler has access to, and can perform inter-module optimization, etc., by accessing information available to the *compiler* front-end, not just some crippled object file format. The root of the current nonsense is that perfectly-fine data structures are arbitrarily required to be flattened into some kind of intermediate form, written to some file (or sent down some pipe), often with loss of information, then read from the other end, interpreted, and reconstituted into other data structures (with incomplete info), then processed. In many cases, information that didn't make it through the channel has to be reconstructed (often imperfectly), and then used. Most of these steps are redundant. If the compiler data structures were already directly available in the first place, none of this baroque dance is necessary. > *Semantic info on interdependency of symbols in a source file is > destroyed right before the linker and thus each .obj file is > included as a whole or not at all. Thus all C run-times I've seen > _sidestep_ this by writing each function in its own file(!). Even > this alone should have been a clear indication. > > While simplicity (and correspondingly size in memory) of programs > was the king in 70's it's well past due. Nowadays I think is all > about getting highest throughput and more powerful features. [...] Simplicity is good. Simplicity lets you modularize a very complex piece of software (a compiler that converts D source code into executables) into manageable chunks. Simplicity does not require shoe-horning modules into separate programs with separate address spaces with separate (and deficient) input/output formats. The problem isn't with simplicity, the problem is with carrying over the archaic mapping of c
Re: Compilation strategy
On Mon, 17 Dec 2012 13:47:36 -0800, Walter Bright wrote: I've often thought Java bytecode was a complete joke. It doesn't deliver any of its promises. You could tokenize Java source code, run the result through an lzw compressor, and get the equivalent functionality in every way. Not true at all. Bytecode is semi-optimized, easier to manipulate with (obfuscate, instrument, etc), JVM/CLR bytecode is shared by many languages (Java, Scala/C#,F#) so you don't need a separate parser for each language, and there is hardware that supports running JVM bytecode on the metal. Try doing the same with lzw'd source code.
Re: Compilation strategy
Am 17.12.2012 23:23, schrieb Walter Bright: On 12/17/2012 2:08 PM, Dmitry Olshansky wrote: I really loved the way Turbo Pascal units were made. I wish D go the same route. Object files would then be looked at as minimal and stupid variation of module where symbols are identified by mangling (not plain meta data as (would be) in module) and no source for templates is emitted. +1 I'll bite. How is this superior to D's system? I have never used TP. Just explaining the TP way, not doing comparisons. Each unit (module) is a single file and contains all declarations, there is a separation between the public and implementation part. Multiple units can be circular dependent, if they depend between each other on the implementation part. The compiler and IDE are able to extract all the necessary information from a unit file, thus making a single file all that is required for making the compiler happy and avoiding synchronization errors. Like any language using modules, the compiler is pretty fast and uses an included linker optimized for the information stored in the units. Besides the IDE, there are command line utilities that dump the public information of a given unit, as a way for programmers to read the available exported API. Basically not much different from what Java and .NET do, but with a language that by default uses native compilation tooling. -- Paulo
Re: Compilation strategy
On Monday, 17 December 2012 at 22:12:01 UTC, Walter Bright wrote: On 12/17/2012 1:53 PM, Rob T wrote: I mentioned in a previous post that we should perhaps focus on making the .di concept more efficient rather than focus on obfuscation. We're not going to do obfuscation, because as I explained such cannot work, and we shouldn't do a disservice to users by pretending it does. There are many ways that *do* work, such as PIMPL, which work today and should be used by any organization wishing to obfuscate their implementation. I agree. I suspect most file transport protocols already compress the data, so compressing it ourselves probably accomplishes nothing. There are also compressed filesystems, so storing files in a compressed manner likely accomplishes little. Yes however my understanding is that html based file transfers are often not compressed despite the protocol specifically supporting the feature. The problem is not with the protocol, its that some clients and servers simply do not implement the feature or are in a misconfigured state. HTML as you know is very widely used for transferring files. Another thing to consider, is for using byte code for interpretation, that way D could be used directly in game engines in place of LUA or other scripting methods, or even as a replacement for Java Script. Of course you know best if this is practical for a language like D, but maybe a subset of D is practical, I don't know. I have toyed with the idea many times, however, of having dmd support zip files. Zip files can contain an arbitrary file hierarchy, with individual files in compressed, encrypted, or plaintext at the selection of the zip builder. An entire project, or library, or collection of source modules can be distributed as a zip file, and could be compiled with nothing more than: dmd foo.zip or: dmd myfile ThirdPartyLib.zip and have it work. The advantage here is simply that everything can be contained in one simple file. The concept is simple. The files in the zip file simply replace the zip file in the command line. So, if foo.zip contains a.d, b/c.d, and d.obj, then: dmd xx foo.zip is equivalent to: unzip foo dmd xx a.d b/c.d d.obj P.S. I've also wanted to use .zip files as the .lib file format (!), as the various .lib formats have nothing over .zip files. Wow, I think that's exactly what we could use! It serves multiple optional use cases all at once! Was there a technical reason for you not getting around towards implementing, or just a lack of time? --rt
Re: Compilation strategy
On Monday, 17 December 2012 at 21:36:46 UTC, Walter Bright wrote: On 12/17/2012 12:49 PM, deadalnix wrote: Granted, this is still easier than assembly, but you neglected the fact that java is rather simple, where D isn't. It is unlikely that an optimized D bytecode can ever be decompiled in a satisfying way. Please listen to me. You have FULL TYPE INFORMATION in the Java bytecode. That is not true for scalla or clojure. Java bytecode don't allow to express closure and similar concept. Decoded scalla bytecode is frankly hard to understand. Java bytecode is nice to decompile java, nothing else. You have ZERO, ZERO, ZERO type information in object code. (Well, you might be able to extract some from mangled global symbol names, for C++ and D (not C), if they haven't been stripped.) Do not underestimate what the loss of ALL the type information means to be able to do meaningful decompilation. Please understand that I actually do know what I'm talking about with this stuff. I have written a Java compiler. I know what it emits. I know what's in Java bytecode, and how it is TRIVIALLY reversed back into Java source. I know that. I not arguing against that. I'm arguing against the fact that this is a blocker. This is blocker in very few use cases in fact. I just look at the whole picture here. People needing that are the exception, not the rule. And what prevent us from using a bytecode that loose information ? As long as it is CTFEable, most people will be happy.
Re: Compilation strategy
On Monday, 17 December 2012 at 22:12:01 UTC, Walter Bright wrote: On 12/17/2012 1:53 PM, Rob T wrote: I mentioned in a previous post that we should perhaps focus on making the .di concept more efficient rather than focus on obfuscation. We're not going to do obfuscation, because as I explained such cannot work, and we shouldn't do a disservice to users by pretending it does. There are many ways that *do* work, such as PIMPL, which work today and should be used by any organization wishing to obfuscate their implementation. Shorter file sizes is a potential use case, and you could even allow a distributor of byte code to optionally supply in compressed form that is automatically uncompressed when compiling, although as a trade off that would add on a small compilation performance hit. I suspect most file transport protocols already compress the data, so compressing it ourselves probably accomplishes nothing. There are also compressed filesystems, so storing files in a compressed manner likely accomplishes little. I have toyed with the idea many times, however, of having dmd support zip files. Zip files can contain an arbitrary file hierarchy, with individual files in compressed, encrypted, or plaintext at the selection of the zip builder. An entire project, or library, or collection of source modules can be distributed as a zip file, and could be compiled with nothing more than: dmd foo.zip or: dmd myfile ThirdPartyLib.zip and have it work. The advantage here is simply that everything can be contained in one simple file. The concept is simple. The files in the zip file simply replace the zip file in the command line. So, if foo.zip contains a.d, b/c.d, and d.obj, then: dmd xx foo.zip is equivalent to: unzip foo dmd xx a.d b/c.d d.obj P.S. I've also wanted to use .zip files as the .lib file format (!), as the various .lib formats have nothing over .zip files. Yes please. This is successfully used in Java, their Jar files are actually zip files that can contain: source code, binary files, resources, documentation, package meta-data, etc. This can store multiple binaries inside to support multi-arch (as in OSX). .NET assemblies accomplish similar goals (although I don't know if they are zip archives internally as well). D can support both "legacy" C/C++ compatible formats (lib/obj/headers) and this new dlib format.
Re: Compilation strategy
On 12/17/2012 2:08 PM, Dmitry Olshansky wrote: I really loved the way Turbo Pascal units were made. I wish D go the same route. Object files would then be looked at as minimal and stupid variation of module where symbols are identified by mangling (not plain meta data as (would be) in module) and no source for templates is emitted. +1 I'll bite. How is this superior to D's system? I have never used TP. *Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication. This is done using COMDATs in C++ and D today.
Re: Compilation strategy
Walter Bright: dmd foo.zip or: dmd myfile ThirdPartyLib.zip and have it work. The advantage here is simply that everything can be contained in one simple file. This was discussed a lot of time ago (even using a "rock" suffix for those zip files) and it seems a nice idea. Bye, bearophile
Re: Compilation strategy
On 12/17/12, Walter Bright wrote: > I have toyed with the idea many times, however, of having dmd support zip > files. I think such a feature is better suited for RDMD. Then many other compilers would benefit, since RDMD can be used with GDC and LDC.
Re: Compilation strategy
On 12/17/2012 1:53 PM, Rob T wrote: I mentioned in a previous post that we should perhaps focus on making the .di concept more efficient rather than focus on obfuscation. We're not going to do obfuscation, because as I explained such cannot work, and we shouldn't do a disservice to users by pretending it does. There are many ways that *do* work, such as PIMPL, which work today and should be used by any organization wishing to obfuscate their implementation. Shorter file sizes is a potential use case, and you could even allow a distributor of byte code to optionally supply in compressed form that is automatically uncompressed when compiling, although as a trade off that would add on a small compilation performance hit. I suspect most file transport protocols already compress the data, so compressing it ourselves probably accomplishes nothing. There are also compressed filesystems, so storing files in a compressed manner likely accomplishes little. I have toyed with the idea many times, however, of having dmd support zip files. Zip files can contain an arbitrary file hierarchy, with individual files in compressed, encrypted, or plaintext at the selection of the zip builder. An entire project, or library, or collection of source modules can be distributed as a zip file, and could be compiled with nothing more than: dmd foo.zip or: dmd myfile ThirdPartyLib.zip and have it work. The advantage here is simply that everything can be contained in one simple file. The concept is simple. The files in the zip file simply replace the zip file in the command line. So, if foo.zip contains a.d, b/c.d, and d.obj, then: dmd xx foo.zip is equivalent to: unzip foo dmd xx a.d b/c.d d.obj P.S. I've also wanted to use .zip files as the .lib file format (!), as the various .lib formats have nothing over .zip files.
Re: Compilation strategy
12/18/2012 12:34 AM, Paulo Pinto пишет: Am 17.12.2012 21:09, schrieb foobar: On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote: On 2012-12-17 03:18:45 +, Walter Bright said: Whether the file format is text or binary does not make any fundamental difference. I too expect the difference in performance to be negligible in binary form if you maintain the same structure. But if you're translating it to another format you can improve the structure to make it faster. If the file had a table of contents (TOC) of publicly visible symbols right at the start, you could read that table of content alone to fill symbol tables while lazy-loading symbol definitions from the file only when needed. Often, most of the file beyond the TOC wouldn't be needed at all. Having to parse and construct the syntax tree for the whole file incurs many memory allocations in the compiler, which you could avoid if the file was structured for lazy-loading. With a TOC you have very little to read from disk and very little to allocate in memory and that'll make compilation faster. More importantly, if you use only fully-qualified symbol names in the translated form, then you'll be able to load lazily privately imported modules because they'll only be needed when you need the actual definition of a symbol. (Template instantiation might require loading privately imported modules too.) And then you could structure it so a whole library could fit in one file, putting all the TOCs at the start of the same file so it loads from disk in a single read operation (or a couple of *sequential* reads). I'm not sure of the speedup all this would provide, but I'd hazard a guess that it wouldn't be so negligible when compiling a large project incrementally. Implementing any of this in the current front end would be a *lot* of work however. Precisely. That is the correct solution and is also how [turbo?] pascal units (==libs) where implemented *decades ago*. I'd like to also emphasize the importance of using a *single* encapsulated file. This prevents synchronization hazards that D inherited from the broken c/c++ model. I really loved the way Turbo Pascal units were made. I wish D go the same route. Object files would then be looked at as minimal and stupid variation of module where symbols are identified by mangling (not plain meta data as (would be) in module) and no source for templates is emitted. AFAIK Delphi is able to produce both DCU and OBJ files (and link with). Dunno what it does with generics (and which kind these are) and how. I really miss it, but at least it has been picked up by Go as well. Still find strange that many C and C++ developers are unaware that we have modules since the early 80's. +1 I suspect it's one of prime examples where UNIX philosophy of combining a bunch of simple (~ dumb) programs together in place of one more complex program was taken *far* beyond reasonable lengths. Having a pipe-line: preprocessor -> compiler -> (still?) assembler -> linker where every program tries hard to know nothing about the previous ones (and be as simple as possibly can be) is bound to get inadequate results on many fronts: - efficiency & scalability - cross-border error reporting and detection (linker errors? errors for expanded macro magic?) - cross-file manipulations (e.g. optimization, see _how_ LTO is done in GCC) - multiple problems from a loss of information across pipeline* *Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication. While simplicity (and correspondingly size in memory) of programs was the king in 70's it's well past due. Nowadays I think is all about getting highest throughput and more powerful features. -- Paulo -- Dmitry Olshansky
Re: Compilation strategy
On Monday, 17 December 2012 at 21:47:36 UTC, Walter Bright wrote: There is no substantive difference between bytecode and source code, as I've been trying to explain. It is not superior in any way, (other than being shorter, and hence less costly to transmit over the internet). I mentioned in a previous post that we should perhaps focus on making the .di concept more efficient rather than focus on obfuscation. Shorter file sizes is a potential use case, and you could even allow a distributor of byte code to optionally supply in compressed form that is automatically uncompressed when compiling, although as a trade off that would add on a small compilation performance hit. --rt
Re: Compilation strategy
On 12/17/2012 12:51 PM, deadalnix wrote: On Monday, 17 December 2012 at 09:40:22 UTC, Walter Bright wrote: On 12/17/2012 12:54 AM, deadalnix wrote: More seriously, I understand that in some cases, di are interesting. Mostly if you want to provide a closed source library to be used by 3rd party devs. You're missing another major use - encapsulation and isolation, reducing the dependencies between parts of your system. For such a case, bytecode is a superior solution as it would allow CTFE. There is no substantive difference between bytecode and source code, as I've been trying to explain. It is not superior in any way, (other than being shorter, and hence less costly to transmit over the internet). I've also done precompiled headers for C and C++, which are more or less a binary module importation format. So, I have extensive personal experience with: 1. bytecode modules 2. binary symbolic modules 3. modules as source code I picked (3) for D, based on real experience with other methods of doing it. (3) really is the best solution. I've often thought Java bytecode was a complete joke. It doesn't deliver any of its promises. You could tokenize Java source code, run the result through an lzw compressor, and get the equivalent functionality in every way. And yes, you can do the same with D modules. Tokenize, run through an lzw compressor, and voila! a "binary" module import format that is small, loads fast, and "obfuscated", for whatever little that is worth.
Re: Compilation strategy
On Monday, 17 December 2012 at 12:54:46 UTC, jerro wrote: If we want to allow D to fit into various niche markets overlooked by C++, for added security, encryption could be added, where the person compiling encrypted .di files would have to supply a key. That would work only for certain situations, not for mass distribution, but it may be useful to enough people. I can't imagine a situation where encrypting .di files would make any sense. Such files would be completely useless without the key, so you would have to either distribute the key along with the files or the compiler would need to contain the key. The former obviously makes encryption pointless and you could only make the latter work by attempting to hide the key inside the compiler. The fact that the compiler is open source would make that harder and someone would eventually manage to extract the key in any case. This whole DRM business would also prevent D from ever being added to GCC. Of course open source code would never be encrypted, I was suggesting an entirely optional convenience feature for users of the compiler and not a general method of storing library files or for providing a fool proof method for the mass distribution of hidden content. Having such a feature would allow a company or individual to package up their source code in a way that no one could look at with out a specific key. It does not matter if the compiler is open source or not, only a user with the correct key could potentially decrypt the contents in a way that was unintended. Obviously anyone who has enough skills and the correct key to a specific encrypted package could decrypt the contents of that specific package (and then post it on usenet or bt or in a million+1 other ways), but you would still need access to the key and you would need access to a tool that decrypts the contents (such as the compiler itself), but that's what security is all about, it's simply a set of barriers that make it difficult, but not impossible, to break through. All security systems are breakable, end of story, no debate there, just take a look around you and see for yourself. The difference between packaging in an encrypted archive, which is later decrypted and installed for use, is that in this case the source code is never left lying around in a form that is decrypted, it is also more secure because the source data is decrypted only when it is being compiled, and the decrypted content is immediately discarded (in a secure way) afterwards. BTW, for the record I'm no fan of DRM in the general sense, but many companies think they need to lock out prying eyes and it's not my place to tell them that they should not be worried about it and fully open up their doors to whatever content they want to distribute. --rt
Re: Compilation strategy
On 12/17/2012 12:49 PM, deadalnix wrote: Granted, this is still easier than assembly, but you neglected the fact that java is rather simple, where D isn't. It is unlikely that an optimized D bytecode can ever be decompiled in a satisfying way. Please listen to me. You have FULL TYPE INFORMATION in the Java bytecode. You have ZERO, ZERO, ZERO type information in object code. (Well, you might be able to extract some from mangled global symbol names, for C++ and D (not C), if they haven't been stripped.) Do not underestimate what the loss of ALL the type information means to be able to do meaningful decompilation. Please understand that I actually do know what I'm talking about with this stuff. I have written a Java compiler. I know what it emits. I know what's in Java bytecode, and how it is TRIVIALLY reversed back into Java source. The only difference between Java source code and Java bytecode is the latter has local symbol names and comments stripped out. There's a 1:1 correspondence. This is not at all true with object code. (Because .class files have full type information, a Java compiler can "import" either a .java file or a .class file with equal facility.)
Re: Compilation strategy
On Monday, 17 December 2012 at 20:31:08 UTC, Paulo Pinto wrote: But if there was interest, I am sure there could be a way to store the template information in the compiled module, while exposing the required type parameters for the template in the .di file, a la Ada. -- Paulo Maybe the focus should not be on obfuscation directly, but on making the .di packaging system perform better. If library content can be packaged in a more efficient way through the use of D interface files, then at least there's some practical use for it that may one day get implemented. --rt
Re: Compilation strategy
On Monday, 17 December 2012 at 09:40:22 UTC, Walter Bright wrote: On 12/17/2012 12:54 AM, deadalnix wrote: More seriously, I understand that in some cases, di are interesting. Mostly if you want to provide a closed source library to be used by 3rd party devs. You're missing another major use - encapsulation and isolation, reducing the dependencies between parts of your system. For such a case, bytecode is a superior solution as it would allow CTFE. Do you really want to be recompiling the garbage collector for every module you compile? It's not because the gc is closed source that .di files are useful.
Re: Compilation strategy
On Monday, 17 December 2012 at 09:37:48 UTC, Walter Bright wrote: On 12/17/2012 12:55 AM, Paulo Pinto wrote: Assembly is no different than reversing any other type of bytecode: This is simply not true for Java bytecode. About the only thing you lose with Java bytecode are local variable names. Full type information and the variables themselves are intact. It depends on compiler switch you use. You can strip name, but it obviously impact reflection capabilities. Also, Java is quite easy to decompile due to the very simple structure of the language. Even if in some case, optimization can confuse the decompiler. Try on other JVM languages like closure, scale or groovy. The produced code can hardly be understood except by a specialist. Granted, this is still easier than assembly, but you neglected the fact that java is rather simple, where D isn't. It is unlikely that an optimized D bytecode can ever be decompiled in a satisfying way.
Re: Compilation strategy
Am 16.12.2012 23:32, schrieb Andrej Mitrovic: On 12/16/12, Paulo Pinto wrote: If modules are used correctly, a .di should be created with the public interface and everything else is already in binary format, thus the compiler is not really parsing everything all the time. A lot of D code tends to be templated code, .di files don't help you in that case. Why not? Ada, Modula-3, Eiffel, ML languages are just a few examples of languages that support modules and genericity. So clearly there are some ways of doing it. Granted, in Ada and Modula-3 case you actually have to define the types when importing a module, so there is already a difference. A second issue is that their generic systems are not as powerful as D. I think that the main issue is that the majority seems to be ok with having template code in .di files, and that is ok. But if there was interest, I am sure there could be a way to store the template information in the compiled module, while exposing the required type parameters for the template in the .di file, a la Ada. -- Paulo
Re: Compilation strategy
Am 17.12.2012 21:09, schrieb foobar: On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote: On 2012-12-17 03:18:45 +, Walter Bright said: Whether the file format is text or binary does not make any fundamental difference. I too expect the difference in performance to be negligible in binary form if you maintain the same structure. But if you're translating it to another format you can improve the structure to make it faster. If the file had a table of contents (TOC) of publicly visible symbols right at the start, you could read that table of content alone to fill symbol tables while lazy-loading symbol definitions from the file only when needed. Often, most of the file beyond the TOC wouldn't be needed at all. Having to parse and construct the syntax tree for the whole file incurs many memory allocations in the compiler, which you could avoid if the file was structured for lazy-loading. With a TOC you have very little to read from disk and very little to allocate in memory and that'll make compilation faster. More importantly, if you use only fully-qualified symbol names in the translated form, then you'll be able to load lazily privately imported modules because they'll only be needed when you need the actual definition of a symbol. (Template instantiation might require loading privately imported modules too.) And then you could structure it so a whole library could fit in one file, putting all the TOCs at the start of the same file so it loads from disk in a single read operation (or a couple of *sequential* reads). I'm not sure of the speedup all this would provide, but I'd hazard a guess that it wouldn't be so negligible when compiling a large project incrementally. Implementing any of this in the current front end would be a *lot* of work however. Precisely. That is the correct solution and is also how [turbo?] pascal units (==libs) where implemented *decades ago*. I'd like to also emphasize the importance of using a *single* encapsulated file. This prevents synchronization hazards that D inherited from the broken c/c++ model. I really miss it, but at least it has been picked up by Go as well. Still find strange that many C and C++ developers are unaware that we have modules since the early 80's. -- Paulo
Re: Compilation strategy
On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote: On 2012-12-17 03:18:45 +, Walter Bright said: Whether the file format is text or binary does not make any fundamental difference. I too expect the difference in performance to be negligible in binary form if you maintain the same structure. But if you're translating it to another format you can improve the structure to make it faster. If the file had a table of contents (TOC) of publicly visible symbols right at the start, you could read that table of content alone to fill symbol tables while lazy-loading symbol definitions from the file only when needed. Often, most of the file beyond the TOC wouldn't be needed at all. Having to parse and construct the syntax tree for the whole file incurs many memory allocations in the compiler, which you could avoid if the file was structured for lazy-loading. With a TOC you have very little to read from disk and very little to allocate in memory and that'll make compilation faster. More importantly, if you use only fully-qualified symbol names in the translated form, then you'll be able to load lazily privately imported modules because they'll only be needed when you need the actual definition of a symbol. (Template instantiation might require loading privately imported modules too.) And then you could structure it so a whole library could fit in one file, putting all the TOCs at the start of the same file so it loads from disk in a single read operation (or a couple of *sequential* reads). I'm not sure of the speedup all this would provide, but I'd hazard a guess that it wouldn't be so negligible when compiling a large project incrementally. Implementing any of this in the current front end would be a *lot* of work however. Precisely. That is the correct solution and is also how [turbo?] pascal units (==libs) where implemented *decades ago*. I'd like to also emphasize the importance of using a *single* encapsulated file. This prevents synchronization hazards that D inherited from the broken c/c++ model.
Re: Compilation strategy
On 2012-12-17 16:01, Walter Bright wrote: Yup. I'd be very surprised if they were based on decompiled Windows executables. Not only that, I didn't say decompiling by hand was impossible. I repeatedly said that it can be done by an expert with a lot of patience. But not automatically. Java .class files can be done automatically with free tools. Fair enough. -- /Jacob Carlborg
Re: Compilation strategy
On 12/17/2012 3:02 AM, mist wrote: AFAIK those are more like Windows API & ABI reverse engineered and reimplemented and that is a huge difference. Yup. I'd be very surprised if they were based on decompiled Windows executables. Not only that, I didn't say decompiling by hand was impossible. I repeatedly said that it can be done by an expert with a lot of patience. But not automatically. Java .class files can be done automatically with free tools.
Re: Compilation strategy
On 12/17/2012 4:38 AM, Jacob Carlborg wrote:> On 2012-12-17 10:58, Walter Bright wrote: > >> Google "convert object file to C" > > A few seconds on Google resulted in this: > > http://www.hex-rays.com/products/decompiler/index.shtml > hex-rays is an interactive tool. It's "decompile" to things like this: v81 = 9; v63 = *(_DWORD *)(v62 + 88); if ( v63 ) { v64 = *(int (__cdecl **)(_DWORD, _DWORD, _DWORD, _DWORD, _DWORD))(v63 + 24); if ( v64 ) v62 = v64(v62, v1, *(_DWORD *)(v3 + 16), *(_DWORD *)(v3 + 40), bstrString); } It has wired in some recognition of patterns that call standard functions like strcpy and strlen, but as far as I can see not much else. It's interactive in that you have to supply the interpretation. It's pretty simple to decompile asm by hand into the above, but the work is only just beginning. For example, what is v3+16? Some struct member? Note that there is no type information in the hex-ray output.
Re: Compilation strategy
On 2012-12-17 12:20, eles wrote: I don't know about such frameworks, but the idea that these kind of files should be handled by the compiler, not by the operating system. They are not meant to be applications, but libraries. They are handled by the compiler. GCC has the -framework flag. https://developer.apple.com/library/mac/#documentation/MacOSX/Conceptual/BPFrameworks/Concepts/WhatAreFrameworks.html The Finder also know about these frameworks and bundles and treat them as a single file. -- /Jacob Carlborg
Re: Compilation strategy
It's not as if phobos would be distributed that way. And even it if was, then there'd be an uproar and a fork of the project. I don't think that the FSF would be to happy about adding a front end with DRM support to GCC, even if no encrypted libraries would be added along with it. Of course a fork without DRM support could still be added to GCC, but if support for DRM libraries became part of D, then this would cause problems when some people would choose to actually use this feature with their closed source libraries and those couldn't be used with GDC.
Re: Compilation strategy
On 17 December 2012 12:54, jerro wrote: > If we want to allow D to fit into various niche markets overlooked by C++, >> for added security, encryption could be added, where the person compiling >> encrypted .di files would have to supply a key. That would work only for >> certain situations, not for mass distribution, but it may be useful to >> enough people. >> > > I can't imagine a situation where encrypting .di files would make any > sense. Such files would be completely useless without the key, so you would > have to either distribute the key along with the files or the compiler > would need to contain the key. The former obviously makes encryption > pointless and you could only make the latter work by attempting to hide the > key inside the compiler. The fact that the compiler is open source would > make that harder and someone would eventually manage to extract the key in > any case. This whole DRM business would also prevent D from ever being > added to GCC. > It's not as if phobos would be distributed that way. And even it if was, then there'd be an uproar and a fork of the project. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Re: Compilation strategy
If we want to allow D to fit into various niche markets overlooked by C++, for added security, encryption could be added, where the person compiling encrypted .di files would have to supply a key. That would work only for certain situations, not for mass distribution, but it may be useful to enough people. I can't imagine a situation where encrypting .di files would make any sense. Such files would be completely useless without the key, so you would have to either distribute the key along with the files or the compiler would need to contain the key. The former obviously makes encryption pointless and you could only make the latter work by attempting to hide the key inside the compiler. The fact that the compiler is open source would make that harder and someone would eventually manage to extract the key in any case. This whole DRM business would also prevent D from ever being added to GCC.
Re: Compilation strategy
On 2012-12-17 10:58, Walter Bright wrote: Google "convert object file to C" A few seconds on Google resulted in this: http://www.hex-rays.com/products/decompiler/index.shtml -- /Jacob Carlborg
Re: Compilation strategy
On Monday, 17 December 2012 at 09:58:28 UTC, Walter Bright wrote: On 12/17/2012 1:35 AM, Paulo Pinto wrote: It suffices to get the general algorithm behind the code, and that is impossible to hide, unless the developer resorts to cryptography. I'll say again, with enough effort, an expert *can* decompile object files by hand. You can't make a tool to do that for you, though. It can also be pretty damned challenging to figure out the algorithm used in a bit of non-trivial assembler after it's gone through a modern compiler optimizer. I know nobody here wants to believe me, but it is trivial to automatically turn Java bytecode back into source code. Google "convert .class file to .java": http://java.decompiler.free.fr/ Now try: Google "convert object file to C" If you don't believe me, a guy who's been working on C compilers for 30 years, and who also wrote a Java compiler, that should be a helpful data point. Of course I believe you and respect your experience. The point I was trying to make is that if someone really wants your code, they will get it, even if that means reading assembly instructions manually. In one company I used to work, we rewrote the TCL parser to read encrypted files to avoid delivering text to the customer, hoping that it would be enough to detain most people. -- Paulo
Re: Compilation strategy
Sounds a lot like frameworks and other type of bundles on Mac OS X. A framework is a folder, with the .framework extension, containing a dynamic library, header files and all other necessary resource files like images and so on. I don't know about such frameworks, but the idea that these kind of files should be handled by the compiler, not by the operating system. They are not meant to be applications, but libraries.
Re: Compilation strategy
AFAIK those are more like Windows API & ABI reverse engineered and reimplemented and that is a huge difference. On Monday, 17 December 2012 at 10:01:35 UTC, Jacob Carlborg wrote: On 2012-12-17 09:21, Walter Bright wrote: I know what I'm talking about with this. The only time they get reverse engineered is when somebody really really REALLY wants to do it, an expert is necessary to do the job, and it's completely impractical for larger sets of files. You cannot build a tool to do it, it must be done by hand, line by line. It's the proverbial turning of hamburger back into a cow. Evert heard of Wine or ReactOS, it's basically Windows reversed engineered.
Re: Compilation strategy
On 2012-12-17 10:13, eles wrote: WRT to all opinions above (ie: binary vs text, what to put etc.) I had some reflection on that some time ago: how about bundling a "header" file (that would be the .di file) and a binary file (the compiled .d file, that is the .obj file) into a single .zip (albeit with another extension), that will be recognized and processed by the D compiler (let's name that file a .dobj). Idea may seem a bit crazy, but consider the following: -the standard .zip format could be used by a user of that object/library to learn the interface of the functions provided by the object (just like a C header file) -if he's a power user, he can simply extract the .zip/.dobj, modify the included header (adding comments, for example), then archive that back and present the compiler a "fresh" .dobj/library file Sounds a lot like frameworks and other type of bundles on Mac OS X. A framework is a folder, with the .framework extension, containing a dynamic library, header files and all other necessary resource files like images and so on. The responsability of maintaining the .obj and the header in sync will be of the compiler or of the power user, if the latter edit it manually. More, IDEs could simply extract relevant header information from the .zip archive and use it for code completion, documentation and so all. Basically, this would be like bundling a .h file with the corresponding .obj file (if we speak C++), all that under a transparent format. The code is hidden and obfuscated, just like in a standard library (think -lstdc++ vs ). The use of a single file greatly facilitate synchronization, while the use of the standard .zip format allow a plethora of tools to manually tune the file (if desired). This can be extended also to entire .dlib (that is, archive of .dobjs), which can become self-documenting, that way. I kinda of dreamt about that since programming in C++ and always needed to have the headers and the libs with me. Why do not include the headers in the lib, in a transparent and manually readable/editable format? A checksum could guarantee also that the header information and the binary information are in sync inside the .zip archive. What do you think? In general I think it's better to have a package manager handle this. -- /Jacob Carlborg
Re: Compilation strategy
On 17 December 2012 09:29, Walter Bright wrote: > On 12/17/2012 1:15 AM, Paulo Pinto wrote: > >> http://www.hopperapp.com/ >> >> I really like the way it generates pseudo-code and basic block graphs out >> of >> instruction sequences. >> > > I looked at their examples. Sorry, that's just step one of reverse > engineering an object file. It's a loong way from turning it into > source code. > > For example, consider an optimizer that puts variables int x, class c, and > pointer p all in register EBX. Figure that one out programmatically. Or the > result of a CTFE calculation. Or a template after it's been expanded and > inlined. > Right, there is practically zero chance of being able to come up with 100% identical D code from an object dump / assembly code. Possibly with exception to a few *very* simple cases (hello world!). However it looks like you might just be able to decode it into a bastardised C version. I can't see that hopperapp to be very practical beyond small stuff though... Regards, -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Re: Compilation strategy
On 2012-12-17 09:21, Walter Bright wrote: I know what I'm talking about with this. The only time they get reverse engineered is when somebody really really REALLY wants to do it, an expert is necessary to do the job, and it's completely impractical for larger sets of files. You cannot build a tool to do it, it must be done by hand, line by line. It's the proverbial turning of hamburger back into a cow. Evert heard of Wine or ReactOS, it's basically Windows reversed engineered. -- /Jacob Carlborg
Re: Compilation strategy
On 2012-12-17 09:19, deadalnix wrote: I can't stop myself laughing at people that may think any business can be based on java, PHP or C#. That is a mere dream ! Such technology will simply never get used in companies, because bytecode can be decoded ! Yet there are a lot of business that are based on these languages. -- /Jacob Carlborg
Re: Compilation strategy
On 12/17/2012 1:35 AM, Paulo Pinto wrote: It suffices to get the general algorithm behind the code, and that is impossible to hide, unless the developer resorts to cryptography. I'll say again, with enough effort, an expert *can* decompile object files by hand. You can't make a tool to do that for you, though. It can also be pretty damned challenging to figure out the algorithm used in a bit of non-trivial assembler after it's gone through a modern compiler optimizer. I know nobody here wants to believe me, but it is trivial to automatically turn Java bytecode back into source code. Google "convert .class file to .java": http://java.decompiler.free.fr/ Now try: Google "convert object file to C" If you don't believe me, a guy who's been working on C compilers for 30 years, and who also wrote a Java compiler, that should be a helpful data point.
Re: Compilation strategy
On 12/17/2012 1:45 AM, Paulo Pinto wrote: Pencil and paper? Yes, as I wrote, you can reverse engineer object files, instruction by instruction, by an expert with pencil and paper. You can't make a tool to do it automatically. You *can* make such a tool for Java bytecode files, and such free tools appeared right after Java was initially released.