subject:"Re\: Compilation strategy"

Re: Compilation strategy

2012-12-19 Thread Dmitry Olshansky


12/18/2012 9:15 PM, Walter Bright пишет:

On 12/18/2012 8:48 AM, Dmitry Olshansky wrote:

After dropping debug info I can't yet make heads or tails of what's in
the exe
yet but it _seems_ to not include all of the unused code. Gotta
investigate on a
smaller sample.


Generate a linker .map file (-map to dmd). That'll tell you what's in it.


It's rather enlightening especially after running ddemangle over it.

Still it tells only half the story - what symbols are there (and a lot 
of them shouldn't have been) - now the most important part to figure out 
is _why_.


Given that almost everything is templates and not instantiated (thus 
thank god is not present). Still both quite some templates and certain 
normal functions made it in without ever being called. I'm sure they are 
not called because I just imported the module. Adding trace prints to 
the functions in question shows nothing on screen.


I tried running linker with -xref and I see that the stuff I don't 
expect to land in .exe looks either like this:


 Symbol   Defined   Referenced
immutable(unicode_tables.SetEntry!(ushort).SetEntry) 
unicode_tables.unicodeCased

   unicode_tables

(Meaning that it's not referenced anywhere yet present but I guess 
unreferenced global data is not stripped away)


Or (for functions):

dchar uni.toUpper(dchar)
   uni   uni

const(@trusted dchar function(uint)) uni.Grapheme.opIndex
   uni   uni

...

Meaning that it's defined & referenced in the same module only (not the 
one with empty main). Yet it's getting pulled in... I'm certain that at 
least toUpper is not called anywhere in the empty module (nor module 
ctors, as I have none).


Can you recommend any steps to see the web of symbols that eventually 
pulls them in? Peeking at dependency chain (not file-grained but symbol 
grained) in any form would have been awesome.


--
Dmitry Olshansky

Re: Compilation strategy

2012-12-19 Thread foobar

On Wednesday, 19 December 2012 at 17:17:34 UTC, Dmitry Olshansky 
wrote:

12/19/2012 1:33 AM, Walter Bright пишет:

On 12/18/2012 11:58 AM, Dmitry Olshansky wrote:
The same bytecode then could be be used for external 
representation.


Sigh, there is (again) no point to an external bytecode.


BTW In the end I think I was convinced that bytecode won't buy 
D much. Esp considering the cost of maintaining a separate spec 
for it and making sure both are in sync.


This argument is bogus. One of the goals of bytecode formats is 
to provide a common representation for many languages. That is 
one of the state goals for MS' CIL and LLVM. So while it's 
definitely true that maintaining and *additional* format and spec 
adds considerable costs, but more importantly this is incorrect 
when *reusing already existing* such formats, not to mention the 
benefits of interoperability with other supported languages and 
platforms.
Consider, calling Java libraries from JRuby, using C# code in F# 
projects, etc.


Say I want to use both Haskel and D in the same project, How 
would I do it? Using LLVM I should be able to - both GHC and LDC 
are based on LLVM.

Re: Compilation strategy

2012-12-19 Thread Dmitry Olshansky


12/19/2012 12:15 AM, Paulo Pinto пишет:

Am 18.12.2012 21:09, schrieb Dmitry Olshansky:

12/19/2012 12:01 AM, Jacob Carlborg пишет:

On 2012-12-18 17:48, Dmitry Olshansky wrote:


I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy
it if need be).


dumpobj is included in the DMD release, at least on Mac OS X.


And linux has it. Guess Windows sucks ...



dumpbin


Only COFF I guess ;)

--
Dmitry Olshansky

Re: Compilation strategy

2012-12-19 Thread Dmitry Olshansky


12/19/2012 1:33 AM, Walter Bright пишет:

On 12/18/2012 11:58 AM, Dmitry Olshansky wrote:

The same bytecode then could be be used for external representation.


Sigh, there is (again) no point to an external bytecode.


BTW In the end I think I was convinced that bytecode won't buy D much. 
Esp considering the cost of maintaining a separate spec for it and 
making sure both are in sync.


--
Dmitry Olshansky

Re: Compilation strategy

2012-12-18 Thread Rob T


On Tuesday, 18 December 2012 at 17:30:41 UTC, Walter Bright wrote:
If I was doing it, and speed was paramount, I'd probably fix it 
to generate native code instead of bytecode and so execute code 
directly. Even simple JITs dramatically speeded up the early 
Java VMs.


Could you re-use the compiler recursively to first compile and 
run CTFE, followed by the rest?


--rt

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 11:26 AM, H. S. Teoh wrote:

Um, it does introduce major support costs for porting to different CPU
targets.

[...]

Could you elaborate?


Sure. You have to rewrite it when going from 32 to 64 bit code, or to ARM, or to 
any other processor. It's not the same as the regular code generator.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 11:58 AM, Dmitry Olshansky wrote:

The same bytecode then could be be used for external representation.


Sigh, there is (again) no point to an external bytecode.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 11:23 AM, H. S. Teoh wrote:

On Tue, Dec 18, 2012 at 09:55:57AM -0800, Walter Bright wrote:

On 12/18/2012 9:42 AM, H. S. Teoh wrote:

I was thinking more along the lines of things like fully automatic
purity, safety, exception inference. For example, every function body
eventually has to be processed by the compiler, so if a particular
function is inferred to throw exception X, for example, then when its
callers are compiled, this fact can be propagated to them. To do this
for the whole program might be infeasible due to the sheer size of
things, but if a module contains, for each function exposed by the
API, a list of all thrown exceptions, then when the module is
imported this information is available up-front and can be propagated
further up the call chain. Same thing goes with purity and @safe.

This may even allow us to make pure/@safe/nothrow fully automated so
that you don't have to explicitly state them (except when you want
the compiler to verify that what you wrote is actually pure, safe,
etc.).


The trouble with this is the separate compilation model. If the
attributes are not in the function signature, then the function
implementation can change without recompiling the user of that
function. Changing the inferred attributes then will subtly break
your build.


And here's a reason for using an intermediate format (whether it's
bytecote or just plain serialized AST or something else, is irrelevant).
Say we put the precompiled module in a zip file of some sort.  If the
function attributes change, so does the zip file. So if proper make
dependencies are setup, this will automatically trigger the
recompilation of whoever uses the module.


Relying on a makefile being correct does not solve it.



Inferred attributes only work when the implementation source is
guaranteed to be available, such as with template functions.

Having a binary format doesn't change this.


Actually, this doesn't depend on the format being binary. You can save
everything in plain text format and it will still work. In fact, there
might be reasons to want a text format instead of binary, since then one
could look at the compiler output to find out what the inferred
attributes of a particular declaration are without needing to add
compiler querying features.


The "plain text format" that works is called D source code :-)

Re: Compilation strategy

2012-12-18 Thread Timon Gehr


On 12/18/2012 08:26 PM, H. S. Teoh wrote:

On Tue, Dec 18, 2012 at 10:06:43AM -0800, Walter Bright wrote:

On 12/18/2012 9:49 AM, H. S. Teoh wrote:

Is it too late to change CTFE to work via native code?


No, because doing so involves zero language changes. It is purely a
quality-of-implementation issue.


Well, that much is obvious; I was just wondering if the current
implementation will require too much effort to make it work with native
code CTFE.



Besides the effort required to rework the existing code (and perhaps
the cross-compiling issue, though I don't see it as a major issue),


Um, it does introduce major support costs for porting to different CPU
targets.

[...]

Could you elaborate? In my mind, there's not much additional cost to
what's already involved in targeting a particular CPU in the first
place. Since the CPU is already targeted, we generate native code for it
and run that during CTFE.
...


The generated native code would need to be different in order to support 
proper error reporting and dependency handling. (The generated code must 
be able to communicate with the analyzer/jit compiler.)


The compiler does not have the full picture during analysis. It needs to 
figure out what information a CTFE-call depends on. The only way that 
works in general is running it.



Eg:

string good(){
mixin(foo(0,()=>good())); // ok, delegate never called
}

string bad(){
mixin(foo(1,()=>bad())); // error, need body of bad to generate 
body of bad

}

string foo(bool b, string delegate() dg){
if(b) return dg();
return q{return "return 0;";};
}

Re: Compilation strategy

2012-12-18 Thread Paulo Pinto


Am 18.12.2012 21:09, schrieb Dmitry Olshansky:

12/19/2012 12:01 AM, Jacob Carlborg пишет:

On 2012-12-18 17:48, Dmitry Olshansky wrote:


I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy
it if need be).


dumpobj is included in the DMD release, at least on Mac OS X.


And linux has it. Guess Windows sucks ...



dumpbin

Re: Compilation strategy

2012-12-18 Thread Dmitry Olshansky


12/19/2012 12:01 AM, Jacob Carlborg пишет:

On 2012-12-18 17:48, Dmitry Olshansky wrote:


I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy
it if need be).


dumpobj is included in the DMD release, at least on Mac OS X.


And linux has it. Guess Windows sucks ...

--
Dmitry Olshansky

Re: Compilation strategy

2012-12-18 Thread Jacob Carlborg


On 2012-12-18 17:48, Dmitry Olshansky wrote:


I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy
it if need be).


dumpobj is included in the DMD release, at least on Mac OS X.

--
/Jacob Carlborg

Re: Compilation strategy

2012-12-18 Thread Dmitry Olshansky


12/18/2012 9:30 PM, Walter Bright пишет:

On 12/18/2012 8:57 AM, Dmitry Olshansky wrote:

But adequate bytecode designed for interpreters (see e.g. Lua) are
designed for
faster execution. The way CTFE is done now* is a polymorphic call per
AST-node
that does a lot of analysis that could be decided once and stored in
... *ehm*
... IR. Currently it's also somewhat mixed with semantic analysis
(thus rising
the complexity).


The architectural failings of CTFE are primary my fault from taking an
implementation shortcut and building it out of enhancing the constant
folding code.

They are not a demonstration of inherent superiority of one scheme or
another. Nor does CTFE's problems indicate that modules should be
represented as bytecode externally.

Agreed. It seemed to me that since CTFE implements an interpreter for D 
it would be useful to define a flattened representation of semantically 
analyzed AST that is tailored for execution. The same bytecode then 
could be be used for external representation.


There is however a problem of templates that can only be analyzed on 
instantiation. Then indeed we can't fully "precompile" semantic step 
into bytecode meaning that it won't be much beyond flattened result of 
parse step. So on this second thought it may not that useful after all.



Another point is that pointer chasing data-structures is not a recipe
for fast
repeated execution.

To provide an analogy: executing calculation recursively on AST tree of
expression is bound to be slower then running the same calculation
straight on
sanely encoded flat reverse-polish notation.

A hit below belt: also peek at your own DMDScript - why bother with
plain IR
(_bytecode_!) for JavaScript if it could just fine be interpreted as
is on AST-s?


Give me some credit for learning something over the last 12 years! I'm
not at all convinced I'd use the same design if I were doing it now.


OK ;)


If I was doing it, and speed was paramount, I'd probably fix it to
generate native code instead of bytecode and so execute code directly.
Even simple JITs dramatically speeded up the early Java VMs.


Granted JIT is faster but I'm personally more interested in portable 
interpreters. I've been digging around and gathering techniques and so 
far it looks rather promising.
Though I need more field testing... and computed gotos in D! Or more 
specifically a way to _force_ tail-call.


--
Dmitry Olshansky

Re: Compilation strategy

2012-12-18 Thread H. S. Teoh

On Tue, Dec 18, 2012 at 10:06:43AM -0800, Walter Bright wrote:
> On 12/18/2012 9:49 AM, H. S. Teoh wrote:
> >Is it too late to change CTFE to work via native code?
> 
> No, because doing so involves zero language changes. It is purely a
> quality-of-implementation issue.

Well, that much is obvious; I was just wondering if the current
implementation will require too much effort to make it work with native
code CTFE.

> >Besides the effort required to rework the existing code (and perhaps
> >the cross-compiling issue, though I don't see it as a major issue),
> 
> Um, it does introduce major support costs for porting to different CPU
> targets.
[...]

Could you elaborate? In my mind, there's not much additional cost to
what's already involved in targeting a particular CPU in the first
place. Since the CPU is already targeted, we generate native code for it
and run that during CTFE.

Or are you referring to cross-compiling?

T

-- 
Tech-savvy: euphemism for nerdy.

Re: Compilation strategy

2012-12-18 Thread H. S. Teoh

On Tue, Dec 18, 2012 at 09:55:57AM -0800, Walter Bright wrote:
> On 12/18/2012 9:42 AM, H. S. Teoh wrote:
> >I was thinking more along the lines of things like fully automatic
> >purity, safety, exception inference. For example, every function body
> >eventually has to be processed by the compiler, so if a particular
> >function is inferred to throw exception X, for example, then when its
> >callers are compiled, this fact can be propagated to them. To do this
> >for the whole program might be infeasible due to the sheer size of
> >things, but if a module contains, for each function exposed by the
> >API, a list of all thrown exceptions, then when the module is
> >imported this information is available up-front and can be propagated
> >further up the call chain. Same thing goes with purity and @safe.
> >
> >This may even allow us to make pure/@safe/nothrow fully automated so
> >that you don't have to explicitly state them (except when you want
> >the compiler to verify that what you wrote is actually pure, safe,
> >etc.).
> 
> The trouble with this is the separate compilation model. If the
> attributes are not in the function signature, then the function
> implementation can change without recompiling the user of that
> function. Changing the inferred attributes then will subtly break
> your build.

And here's a reason for using an intermediate format (whether it's
bytecote or just plain serialized AST or something else, is irrelevant).
Say we put the precompiled module in a zip file of some sort.  If the
function attributes change, so does the zip file. So if proper make
dependencies are setup, this will automatically trigger the
recompilation of whoever uses the module.

> Inferred attributes only work when the implementation source is
> guaranteed to be available, such as with template functions.
> 
> Having a binary format doesn't change this.

Actually, this doesn't depend on the format being binary. You can save
everything in plain text format and it will still work. In fact, there
might be reasons to want a text format instead of binary, since then one
could look at the compiler output to find out what the inferred
attributes of a particular declaration are without needing to add
compiler querying features.

T

-- 
Why have vacation when you can work?? -- EC

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 9:49 AM, H. S. Teoh wrote:

Is it too late to change CTFE to work via native code?


No, because doing so involves zero language changes. It is purely a 
quality-of-implementation issue.



Besides the
effort required to rework the existing code (and perhaps the
cross-compiling issue, though I don't see it as a major issue),


Um, it does introduce major support costs for porting to different CPU targets.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 9:42 AM, H. S. Teoh wrote:

I was thinking more along the lines of things like fully automatic
purity, safety, exception inference. For example, every function body
eventually has to be processed by the compiler, so if a particular
function is inferred to throw exception X, for example, then when its
callers are compiled, this fact can be propagated to them. To do this
for the whole program might be infeasible due to the sheer size of
things, but if a module contains, for each function exposed by the API,
a list of all thrown exceptions, then when the module is imported this
information is available up-front and can be propagated further up the
call chain. Same thing goes with purity and @safe.

This may even allow us to make pure/@safe/nothrow fully automated so
that you don't have to explicitly state them (except when you want the
compiler to verify that what you wrote is actually pure, safe, etc.).


The trouble with this is the separate compilation model. If the attributes are 
not in the function signature, then the function implementation can change 
without recompiling the user of that function. Changing the inferred attributes 
then will subtly break your build.


Inferred attributes only work when the implementation source is guaranteed to be 
available, such as with template functions.


Having a binary format doesn't change this.

Re: Compilation strategy

2012-12-18 Thread H. S. Teoh

On Tue, Dec 18, 2012 at 09:30:40AM -0800, Walter Bright wrote:
> On 12/18/2012 8:57 AM, Dmitry Olshansky wrote:
[...]
> >Another point is that pointer chasing data-structures is not a recipe
> >for fast repeated execution.
> >
> >To provide an analogy: executing calculation recursively on AST tree
> >of expression is bound to be slower then running the same calculation
> >straight on sanely encoded flat reverse-polish notation.
> >
> >A hit below belt: also peek at your own DMDScript - why bother with
> >plain IR (_bytecode_!) for JavaScript if it could just fine be
> >interpreted as is on AST-s?
> 
> Give me some credit for learning something over the last 12 years!
> I'm not at all convinced I'd use the same design if I were doing it
> now.
> 
> If I was doing it, and speed was paramount, I'd probably fix it to
> generate native code instead of bytecode and so execute code
> directly. Even simple JITs dramatically speeded up the early Java
> VMs.
[...]

Is it too late to change CTFE to work via native code? Besides the
effort required to rework the existing code (and perhaps the
cross-compiling issue, though I don't see it as a major issue), I see a
lot of advantages to doing that. For one thing, it will solve the
current complaints about CTFE speed and memory usage (a native code
implementation would allow using a GC to keep memory footprint down, or
perhaps just a sandbox that can be ditched after evaluation and its
memory reclaimed).

T

-- 
Obviously, some things aren't very obvious.

Re: Compilation strategy

2012-12-18 Thread H. S. Teoh

On Tue, Dec 18, 2012 at 08:12:51AM -0800, Walter Bright wrote:
> On 12/18/2012 7:51 AM, H. S. Teoh wrote:
> >An idea occurred to me while reading this. What if, when compiling a
> >module, say, the compiler not only emits object code, but also
> >information like which functions are implied to be strongly pure,
> >weakly pure, @safe, etc., as well as some kind of symbol dependency
> >information. Basically, any derived information that isn't
> >immediately obvious from the code is saved.
> >
> >Then when importing the module, the compiler doesn't have to
> >re-derive all of this information, but it is immediately available.
> >
> >One can also include information like whether a function actually
> >throws an exception (regardless of whether it's marked nothrow),
> >which exception(s) it throws, etc.. This may open up the possibility
> >of doing some things with the language that are currently infeasible,
> >regardless of the obfuscation issue.
> 
> This is a binary import. It offers negligible advantages over .di
> files.

I was thinking more along the lines of things like fully automatic
purity, safety, exception inference. For example, every function body
eventually has to be processed by the compiler, so if a particular
function is inferred to throw exception X, for example, then when its
callers are compiled, this fact can be propagated to them. To do this
for the whole program might be infeasible due to the sheer size of
things, but if a module contains, for each function exposed by the API,
a list of all thrown exceptions, then when the module is imported this
information is available up-front and can be propagated further up the
call chain. Same thing goes with purity and @safe.

This may even allow us to make pure/@safe/nothrow fully automated so
that you don't have to explicitly state them (except when you want the
compiler to verify that what you wrote is actually pure, safe, etc.).

T

-- 
Тише едешь, дальше будешь.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 8:57 AM, Dmitry Olshansky wrote:

But adequate bytecode designed for interpreters (see e.g. Lua) are designed for
faster execution. The way CTFE is done now* is a polymorphic call per AST-node
that does a lot of analysis that could be decided once and stored in ... *ehm*
... IR. Currently it's also somewhat mixed with semantic analysis (thus rising
the complexity).


The architectural failings of CTFE are primary my fault from taking an 
implementation shortcut and building it out of enhancing the constant folding code.


They are not a demonstration of inherent superiority of one scheme or another. 
Nor does CTFE's problems indicate that modules should be represented as bytecode 
externally.



Another point is that pointer chasing data-structures is not a recipe for fast
repeated execution.

To provide an analogy: executing calculation recursively on AST tree of
expression is bound to be slower then running the same calculation straight on
sanely encoded flat reverse-polish notation.

A hit below belt: also peek at your own DMDScript - why bother with plain IR
(_bytecode_!) for JavaScript if it could just fine be interpreted as is on 
AST-s?


Give me some credit for learning something over the last 12 years! I'm not at 
all convinced I'd use the same design if I were doing it now.


If I was doing it, and speed was paramount, I'd probably fix it to generate 
native code instead of bytecode and so execute code directly. Even simple JITs 
dramatically speeded up the early Java VMs.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 8:54 AM, Andrei Alexandrescu wrote:

On 12/18/12 10:01 AM, Walter Bright wrote:

On 12/18/2012 1:43 AM, Dmitry Olshansky wrote:

Compared to doing computations on AST tries (and looking up every name
in symbol
table?), creating fake nodes when the result is computed etc?


CTFE does not look up every (or any) name in the symbol table. I don't
see any advantage to interpreting bytecode over interpreting ASTs. In
fact, all the Java bytecode is is a serialized AST.


My understanding is that Java bytecode is somewhat lowered e.g. using a stack
machine for arithmetic, jumps etc. which makes it more amenable to
interpretation than what an AST walker would do.


The Java bytecode is indeed a stack machine, and a stack machine *is* a 
serialized AST.




Also bytecode is more directly
streamable because you don't need any pointer fixups.


A stack machine *is* a streamable representation of an AST. They are trivially 
convertible back and forth between each other, and I mean trivially.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 8:48 AM, Dmitry Olshansky wrote:

After dropping debug info I can't yet make heads or tails of what's in the exe
yet but it _seems_ to not include all of the unused code. Gotta investigate on a
smaller sample.


Generate a linker .map file (-map to dmd). That'll tell you what's in it.

Re: Compilation strategy

2012-12-18 Thread Dmitry Olshansky


12/18/2012 7:01 PM, Walter Bright пишет:

On 12/18/2012 1:43 AM, Dmitry Olshansky wrote:

Compared to doing computations on AST tries (and looking up every name
in symbol
table?), creating fake nodes when the result is computed etc?


CTFE does not look up every (or any) name in the symbol table.


I stand corrected - ditch "the looking up every name in symbol table".

Honestly I've deduced that from your statement:
>>>the type information and AST trees and symbol table.

Note the symbol table. Looking inside I cannot immediately grasp if it 
ever uses it. I see that e.g. variables are tied to nodes that represent 
declarations, values to expression nodes of already processed AST.



I don't
see any advantage to interpreting bytecode over interpreting ASTs. In
fact, all the Java bytecode is is a serialized AST.


We need no stinkin' Java ;)

But adequate bytecode designed for interpreters (see e.g. Lua) are 
designed for faster execution. The way CTFE is done now* is a 
polymorphic call per AST-node that does a lot of analysis that could be 
decided once and stored in ... *ehm* ... IR. Currently it's also 
somewhat mixed with semantic analysis (thus rising the complexity).


Another point is that pointer chasing data-structures is not a recipe 
for fast repeated execution.


To provide an analogy: executing calculation recursively on AST tree of 
expression is bound to be slower then running the same calculation 
straight on sanely encoded flat reverse-polish notation.


A hit below belt: also peek at your own DMDScript - why bother with 
plain IR (_bytecode_!) for JavaScript if it could just fine be 
interpreted as is on AST-s?


*I judge by a cursory look at source and bits that Don sometimes shares 
about it.


--
Dmitry Olshansky

Re: Compilation strategy

2012-12-18 Thread Andrei Alexandrescu


On 12/18/12 10:01 AM, Walter Bright wrote:

On 12/18/2012 1:43 AM, Dmitry Olshansky wrote:

Compared to doing computations on AST tries (and looking up every name
in symbol
table?), creating fake nodes when the result is computed etc?


CTFE does not look up every (or any) name in the symbol table. I don't
see any advantage to interpreting bytecode over interpreting ASTs. In
fact, all the Java bytecode is is a serialized AST.


My understanding is that Java bytecode is somewhat lowered e.g. using a 
stack machine for arithmetic, jumps etc. which makes it more amenable to 
interpretation than what an AST walker would do. Also bytecode is more 
directly streamable because you don't need any pointer fixups.


In brief I agree there's an isomorphism between bytecode and AST 
representation, but there are a few differences that may be important to 
certain applications.



Andrei

Re: Compilation strategy

2012-12-18 Thread Dmitry Olshansky


12/18/2012 6:51 PM, Walter Bright пишет:

On 12/18/2012 1:33 AM, Dmitry Olshansky wrote:

More then that - the end result is the same: to avoid carrying junk
into an app
you (or compiler) still have to put each function in its own section.


That's what COMDATs are.


Okay..


Doing separate compilation I always (unless doing LTO or template
heavy code)
see either whole or nothing (D included). Most likely the compiler
will do it
for you only with a special switch.


dmd emits COMDATs for all global functions.

You can see this by running dumpobj on the output.


Thanks for carrying on this Q.

I'm using objconv by Agner Fog as I haven't got dumpobj (guess I'll buy 
it if need be). However I see comments in dumped asm  that mark section 
boundaries that all functions are indeed in COMDAT sections.


Still linking these object files and disassembling output I see all of 
functions are there intact. I've added debug symbols to the build though 
- could it make optlink keep symbols?


After dropping debug info I can't yet make heads or tails of what's in 
the exe yet but it _seems_ to not include all of the unused code. Gotta 
investigate on a smaller sample.


--
Dmitry Olshansky

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 7:51 AM, H. S. Teoh wrote:

An idea occurred to me while reading this. What if, when compiling a
module, say, the compiler not only emits object code, but also
information like which functions are implied to be strongly pure, weakly
pure, @safe, etc., as well as some kind of symbol dependency
information. Basically, any derived information that isn't immediately
obvious from the code is saved.

Then when importing the module, the compiler doesn't have to re-derive
all of this information, but it is immediately available.

One can also include information like whether a function actually throws
an exception (regardless of whether it's marked nothrow), which
exception(s) it throws, etc.. This may open up the possibility of doing
some things with the language that are currently infeasible, regardless
of the obfuscation issue.


This is a binary import. It offers negligible advantages over .di files.

Re: Compilation strategy

2012-12-18 Thread H. S. Teoh

On Tue, Dec 18, 2012 at 07:01:28AM -0800, Walter Bright wrote:
> On 12/18/2012 1:43 AM, Dmitry Olshansky wrote:
> >Compared to doing computations on AST tries (and looking up every
> >name in symbol table?), creating fake nodes when the result is
> >computed etc?
> 
> CTFE does not look up every (or any) name in the symbol table. I don't
> see any advantage to interpreting bytecode over interpreting ASTs. In
> fact, all the Java bytecode is is a serialized AST.

I've always thought that CTFE should run native code.

Yes, I'm aware of the objections related to cross-compiling, etc., but
honestly, how many people actually use a cross-compiler (or know what it
is)? Interpreting CTFE to me should be a fallback, not the default mode
of implementation.

T

-- 
Acid falls with the rain; with love comes the pain.

Re: Compilation strategy

2012-12-18 Thread H. S. Teoh

On Tue, Dec 18, 2012 at 06:55:34AM -0800, Walter Bright wrote:
> On 12/18/2012 3:43 AM, foobar wrote:
> >Honest question - If D already has all the semantic info in COMDAT
> >sections,
> 
> It doesn't. COMDATs are object file sections. They do not contain
> type info, for example.
> 
> >  * provide a byte-code solution to support the portability case. e.g
> >  Java byte-code or Google's pNaCL solution that relies on LLVM
> >  bit-code.
> 
> There is no advantage to bytecodes. Putting them in a zip file does
> not make them produce better results.
[...]

An idea occurred to me while reading this. What if, when compiling a
module, say, the compiler not only emits object code, but also
information like which functions are implied to be strongly pure, weakly
pure, @safe, etc., as well as some kind of symbol dependency
information. Basically, any derived information that isn't immediately
obvious from the code is saved.

Then when importing the module, the compiler doesn't have to re-derive
all of this information, but it is immediately available.

One can also include information like whether a function actually throws
an exception (regardless of whether it's marked nothrow), which
exception(s) it throws, etc.. This may open up the possibility of doing
some things with the language that are currently infeasible, regardless
of the obfuscation issue.

T

-- 
There are three kinds of people in the world: those who can count, and those 
who can't.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 1:43 AM, Dmitry Olshansky wrote:

Compared to doing computations on AST tries (and looking up every name in symbol
table?), creating fake nodes when the result is computed etc?


CTFE does not look up every (or any) name in the symbol table. I don't see any 
advantage to interpreting bytecode over interpreting ASTs. In fact, all the Java 
bytecode is is a serialized AST.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 3:43 AM, foobar wrote:

Honest question - If D already has all the semantic info in COMDAT sections,


It doesn't. COMDATs are object file sections. They do not contain type info, for 
example.



  * provide a byte-code solution to support the portability case. e.g Java
byte-code or Google's pNaCL solution that relies on LLVM bit-code.


There is no advantage to bytecodes. Putting them in a zip file does not make 
them produce better results.

Re: Compilation strategy

2012-12-18 Thread Walter Bright


On 12/18/2012 1:33 AM, Dmitry Olshansky wrote:

More then that - the end result is the same: to avoid carrying junk into an app
you (or compiler) still have to put each function in its own section.


That's what COMDATs are.


Doing separate compilation I always (unless doing LTO or template heavy code)
see either whole or nothing (D included). Most likely the compiler will do it
for you only with a special switch.


dmd emits COMDATs for all global functions.

You can see this by running dumpobj on the output.

Re: Compilation strategy

2012-12-18 Thread Paulo Pinto


On Tuesday, 18 December 2012 at 11:43:18 UTC, foobar wrote:
On Monday, 17 December 2012 at 22:24:00 UTC, Walter Bright 
wrote:

On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
I really loved the way Turbo Pascal units were made. I wish D 
go the same
route.  Object files would then be looked at as minimal and 
stupid variation of
module where symbols are identified by mangling (not plain 
meta data as (would

be) in module) and no source for templates is emitted.
+1


I'll bite. How is this superior to D's system? I have never 
used TP.



*Semantic info on interdependency of symbols in a source file 
is destroyed right
before the linker and thus each .obj file is included as a 
whole or not at all.
Thus all C run-times I've seen _sidestep_ this by writing 
each function in its
own file(!). Even this alone should have been a clear 
indication.


This is done using COMDATs in C++ and D today.


Honest question - If D already has all the semantic info in 
COMDAT sections, why do we still require additional auxiliary 
files? Surely, a single binary library (lib/so) should be 
enough to encapsulate a library without the need to re-parse 
the source files or additional header files?


You yourself seem to agree that a single zip file is superior 
to what we currently have and as an aside the entire Java 
community agrees with use - Java Jar/War/etc formats are all 
renamed zip archives.


Regarding the obfuscation and portability issues - the zip file 
can contain whatever we want. This means it should be possible 
to tailor the contents to support different use-cases:
 * provide fat-libraries as in OSX - internally store multiple 
binaries for different architectures, those binary objects are 
very hard to decompile back to source code thus answering the 
obfuscation need.
 * provide a byte-code solution to support the portability 
case. e.g Java byte-code or Google's pNaCL solution that relies 
on LLVM bit-code.


Also, there are different work-flows that can be implemented - 
Java uses JIT to gain efficiency vs. .NET that supports 
install-time AOT compilation. It basically stores the native 
executable in a special cache.


In Windows 8 RT, .NET binaries are actually compiled to native 
code when uploaded to the Windows App Store.

Re: Compilation strategy

2012-12-18 Thread Andrej Mitrovic

On 12/18/12, foobar  wrote:
> Besides, the other compilers merge in the same front-end
> code so they'll gain the same feature anyway. There's no gain in
> separating it out to rdmd.

Adding more front-end features adds more work for maintainers of
compilers which are based on the DMD front-end, and not all compilers
are based on the DMD front-end.

Don't forget the huge gain of using D over C++ to implement the feature.

Re: Compilation strategy

2012-12-18 Thread foobar


On Tuesday, 18 December 2012 at 00:48:40 UTC, Walter Bright wrote:

Wow, I think that's exactly what we could use! It serves 
multiple optional use

cases all at once!

Was there a technical reason for you not getting around 
towards implementing, or

just a lack of time?


There always seemed something more important to be doing, and 
Andrei thought it would be better to put such a capability in 
rdmd rather than dmd.


This is inconsistent with D's design - providing useful features 
built-in (docs generator, testing, profiling, etc).
More over, it breaks encapsulation. This means the compiler 
exposes an inferior format that will later be wrapped around by a 
more capable packaging format, thus exposing the implementation 
details and adding an external dependency on that inferior 
format. Besides, the other compilers merge in the same front-end 
code so they'll gain the same feature anyway. There's no gain in 
separating it out to rdmd.


The main question is if you approve the concept and willing to 
put it on the to-do list? I'm sure that if you endorse this 
feature someone else will come in and implement it.

Re: Compilation strategy

2012-12-18 Thread foobar


On Monday, 17 December 2012 at 22:24:00 UTC, Walter Bright wrote:

On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
I really loved the way Turbo Pascal units were made. I wish D 
go the same
route.  Object files would then be looked at as minimal and 
stupid variation of
module where symbols are identified by mangling (not plain 
meta data as (would

be) in module) and no source for templates is emitted.
+1


I'll bite. How is this superior to D's system? I have never 
used TP.



*Semantic info on interdependency of symbols in a source file 
is destroyed right
before the linker and thus each .obj file is included as a 
whole or not at all.
Thus all C run-times I've seen _sidestep_ this by writing each 
function in its
own file(!). Even this alone should have been a clear 
indication.


This is done using COMDATs in C++ and D today.


Honest question - If D already has all the semantic info in 
COMDAT sections, why do we still require additional auxiliary 
files? Surely, a single binary library (lib/so) should be enough 
to encapsulate a library without the need to re-parse the source 
files or additional header files?


You yourself seem to agree that a single zip file is superior to 
what we currently have and as an aside the entire Java community 
agrees with use - Java Jar/War/etc formats are all renamed zip 
archives.


Regarding the obfuscation and portability issues - the zip file 
can contain whatever we want. This means it should be possible to 
tailor the contents to support different use-cases:
 * provide fat-libraries as in OSX - internally store multiple 
binaries for different architectures, those binary objects are 
very hard to decompile back to source code thus answering the 
obfuscation need.
 * provide a byte-code solution to support the portability case. 
e.g Java byte-code or Google's pNaCL solution that relies on LLVM 
bit-code.


Also, there are different work-flows that can be implemented - 
Java uses JIT to gain efficiency vs. .NET that supports 
install-time AOT compilation. It basically stores the native 
executable in a special cache.

Re: Compilation strategy

2012-12-18 Thread foobar


On Tuesday, 18 December 2012 at 00:15:04 UTC, H. S. Teoh wrote:
On Tue, Dec 18, 2012 at 02:08:55AM +0400, Dmitry Olshansky 
wrote:

[...]

I suspect it's one of prime examples where UNIX philosophy of
combining a bunch of simple (~ dumb) programs together in 
place of
one more complex program was taken *far* beyond reasonable 
lengths.


Having a pipe-line:
preprocessor -> compiler -> (still?) assembler -> linker

where every program tries hard to know nothing about the 
previous

ones (and be as simple as possibly can be) is bound to get
inadequate results on many fronts:
- efficiency & scalability
- cross-border error reporting and detection (linker errors? 
errors

for expanded macro magic?)
- cross-file manipulations (e.g. optimization, see _how_ LTO 
is done in GCC)

- multiple problems from a loss of information across pipeline*


The problem is not so much the structure preprocessor -> 
compiler ->
assembler -> linker; the problem is that these logical stages 
have been
arbitrarily assigned to individual processes residing in their 
own
address space, communicating via files (or pipes, whatever it 
may be).


The fact that they are separate processes is in itself not that 
big of a
problem, but the fact that they reside in their own address 
space is a
big problem, because you cannot pass any information down the 
chain
except through rudimentary OS interfaces like files and pipes. 
Even that

wouldn't have been so bad, if it weren't for the fact that user
interface (in the form of text input / object file format) has 
also been
conflated with program interface (the compiler has to produce 
the input
to the assembler, in *text*, and the assembler has to produce 
object
files that do not encode any direct dependency information 
because

that's the standard file format the linker expects).

Now consider if we keep the same stages, but each stage is not a
separate program but a *library*. The code then might look, in 
greatly

simplified form, something like this:

import libdmd.compiler;
import libdmd.assembler;
import libdmd.linker;

void main(string[] args) {
// typeof(asmCode) is some arbitrarily complex data
// structure encoding assembly code, inter-module
// dependencies, etc.
auto asmCode = compiler.lex(args)
.parse()
.optimize()
.codegen();

// Note: no stupid redundant convert to string, parse,
// convert back to internal representation.
auto objectCode = assembler.assemble(asmCode);

// Note: linker has direct access to dependency info,
// etc., carried over from asmCode -> objectCode.
auto executable = linker.link(objectCode);
File output(outfile, "w");
executable.generate(output);
}

Note that the types asmCode, objectCode, executable, are 
arbitrarily
complex, and may contain lazy-evaluated data structure, 
references to
on-disk temporary storage (for large projects you can't hold 
everything
in RAM), etc.. Dependency information in asmCode is propagated 
to
objectCode, as necessary. The linker has full access to all 
info the
compiler has access to, and can perform inter-module 
optimization, etc.,
by accessing information available to the *compiler* front-end, 
not just

some crippled object file format.

The root of the current nonsense is that perfectly-fine data 
structures
are arbitrarily required to be flattened into some kind of 
intermediate
form, written to some file (or sent down some pipe), often with 
loss of

information, then read from the other end, interpreted, and
reconstituted into other data structures (with incomplete 
info), then
processed. In many cases, information that didn't make it 
through the
channel has to be reconstructed (often imperfectly), and then 
used. Most
of these steps are redundant. If the compiler data structures 
were
already directly available in the first place, none of this 
baroque

dance is necessary.


*Semantic info on interdependency of symbols in a source file 
is

destroyed right before the linker and thus each .obj file is
included as a whole or not at all. Thus all C run-times I've 
seen
_sidestep_ this by writing each function in its own file(!). 
Even

this alone should have been a clear indication.

While simplicity (and correspondingly size in memory) of 
programs
was the king in 70's it's well past due. Nowadays I think is 
all

about getting highest throughput and more powerful features.

[...]

Simplicity is good. Simplicity lets you modularize a very 
complex piece
of software (a compiler that converts D source code into 
executables)
into manageable chunks. Simplicity does not require 
shoe-horning modules
into separate programs with separate address spaces with 
separate (and

deficient) input/output formats.

The problem is

Re: Compilation strategy

2012-12-18 Thread Dmitry Olshansky


12/18/2012 4:42 AM, Walter Bright пишет:

On 12/17/2012 3:03 PM, deadalnix wrote:

I know that. I not arguing against that. I'm arguing against the fact
that this
is a blocker. This is blocker in very few use cases in fact. I just
look at the
whole picture here. People needing that are the exception, not the rule.


I'm not sure what you mean. A blocker for what?



And what prevent us from using a bytecode that loose information ?


I'd turn that around and ask why have a bytecode?



As long as it is CTFEable, most people will be happy.


CTFE needs the type information and AST trees and symbol table.
Everything needed for decompilation.



The fact that CTFE has to crawl AST trees is AFAIK a mere happenstance. 
It does help nothing but the way to hack it into the current compiler 
structure.


There should be a far more suitable IR (if you don't like the bytecode 
term) if we are to run CTFE at least at marginally comparable to 
run-time speeds.



I know that bytecode has been around since 1995 in its current
incarnation, and there's an ingrained assumption that since there's such
an extensive ecosystem around it, that there is some advantage to it.



I don't care for ecosystems. And there is none involved in the argument.


But there isn't.


Compared to doing computations on AST tries (and looking up every name 
in symbol table?), creating fake nodes when the result is computed etc?


I'm out of words.

--
Dmitry Olshansky

Re: Compilation strategy

2012-12-18 Thread Dmitry Olshansky


12/18/2012 2:23 AM, Walter Bright пишет:

On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:

I really loved the way Turbo Pascal units were made. I wish D go the same
route.  Object files would then be looked at as minimal and stupid
variation of
module where symbols are identified by mangling (not plain meta data
as (would
be) in module) and no source for templates is emitted.
+1


I'll bite. How is this superior to D's system? I have never used TP.



One superiority is having a compiled module with public interface (a-la 
.di but in some binary format) in one file. Along with public interface 
it retains dependency information. Basically things that describe one 
entity should not be separated.


I can say that advantage of "grab this single file and you are good to 
go" should not be underestimated.


Thusly there is no mess with header files out of date and/or object 
files that fail to link because of that. Now back then there were no 
templates nor CTFE. So module structure was simple. There were no 
packages (they landed in Delphi).


I'd expect D to have a format built around modules and packages of 
these. Then pre-compiled libraries are commonly distributed as a package.


The upside of having our special format is being able to tailor it for 
our needs e.g. store type info & meta-data plainly (not 
mangle-demangle), having separately compiled (and checked) pure 
functions, better cross symbol dependency etc.


To link with C we could still compile all of D modules into a huge 
object file (split into a monstrous amount of sections).





*Semantic info on interdependency of symbols in a source file is
destroyed right
before the linker and thus each .obj file is included as a whole or
not at all.
Thus all C run-times I've seen _sidestep_ this by writing each
function in its
own file(!). Even this alone should have been a clear indication.


This is done using COMDATs in C++ and D today.


Well, that's terse. Either way it looks like a workaround for templates 
that during separate compilation dump identical code in obj-s to 
auto-merge these.


More then that - the end result is the same: to avoid carrying junk into 
an app you (or compiler) still have to put each function in its own section.


Doing separate compilation I always (unless doing LTO or template heavy 
code) see either whole or nothing (D included). Most likely the compiler 
will do it for you only with a special switch.



This begs another question - why not eliminate junk by default?

P.S.
Looking at M$: http://msdn.microsoft.com/en-us/library/xsa71f43.aspx
it needs 2 switches - 1 for linker 1 for compiler. Hilarious.


--
Dmitry Olshansky

Re: Compilation strategy

2012-12-18 Thread Paulo Pinto

On Tuesday, 18 December 2012 at 07:48:01 UTC, Jacob Carlborg 
wrote:

On 2012-12-17 23:12, Walter Bright wrote:

I have toyed with the idea many times, however, of having dmd 
support
zip files. Zip files can contain an arbitrary file hierarchy, 
with
individual files in compressed, encrypted, or plaintext at the 
selection
of the zip builder. An entire project, or library, or 
collection of
source modules can be distributed as a zip file, and could be 
compiled

with nothing more than:


I think that a package manager should handle this. Example:

https://github.com/jacob-carlborg/orbit/wiki/Orbit-Package-Manager-for-D

Yes I can change the orbfiles to be written in D.


Hehe. :)

I even checked out your code with that idea in mind, but other 
things keep having higher priority.


--
Paulo

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 11:40 PM, Jacob Carlborg wrote:

On 2012-12-17 00:09, Walter Bright wrote:


Figure out the cases where it happens and fix those cases.


How is it supposed to work? Could there be some issue with the dependency
tracker that should otherwise have indicated that more modules should have been
recompiled.


It should only generate function bodies that are needed, not all of them.

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-17 23:12, Walter Bright wrote:


I have toyed with the idea many times, however, of having dmd support
zip files. Zip files can contain an arbitrary file hierarchy, with
individual files in compressed, encrypted, or plaintext at the selection
of the zip builder. An entire project, or library, or collection of
source modules can be distributed as a zip file, and could be compiled
with nothing more than:


I think that a package manager should handle this. Example:

https://github.com/jacob-carlborg/orbit/wiki/Orbit-Package-Manager-for-D

Yes I can change the orbfiles to be written in D.

--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-17 00:09, Walter Bright wrote:


Figure out the cases where it happens and fix those cases.


How is it supposed to work? Could there be some issue with the 
dependency tracker that should otherwise have indicated that more 
modules should have been recompiled.


--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-18 01:13, H. S. Teoh wrote:


The problem is not so much the structure preprocessor -> compiler ->
assembler -> linker; the problem is that these logical stages have been
arbitrarily assigned to individual processes residing in their own
address space, communicating via files (or pipes, whatever it may be).

The fact that they are separate processes is in itself not that big of a
problem, but the fact that they reside in their own address space is a
big problem, because you cannot pass any information down the chain
except through rudimentary OS interfaces like files and pipes. Even that
wouldn't have been so bad, if it weren't for the fact that user
interface (in the form of text input / object file format) has also been
conflated with program interface (the compiler has to produce the input
to the assembler, in *text*, and the assembler has to produce object
files that do not encode any direct dependency information because
that's the standard file format the linker expects).

Now consider if we keep the same stages, but each stage is not a
separate program but a *library*. The code then might look, in greatly
simplified form, something like this:

import libdmd.compiler;
import libdmd.assembler;
import libdmd.linker;

void main(string[] args) {
// typeof(asmCode) is some arbitrarily complex data
// structure encoding assembly code, inter-module
// dependencies, etc.
auto asmCode = compiler.lex(args)
.parse()
.optimize()
.codegen();

// Note: no stupid redundant convert to string, parse,
// convert back to internal representation.
auto objectCode = assembler.assemble(asmCode);

// Note: linker has direct access to dependency info,
// etc., carried over from asmCode -> objectCode.
auto executable = linker.link(objectCode);
File output(outfile, "w");
executable.generate(output);
}

Note that the types asmCode, objectCode, executable, are arbitrarily
complex, and may contain lazy-evaluated data structure, references to
on-disk temporary storage (for large projects you can't hold everything
in RAM), etc.. Dependency information in asmCode is propagated to
objectCode, as necessary. The linker has full access to all info the
compiler has access to, and can perform inter-module optimization, etc.,
by accessing information available to the *compiler* front-end, not just
some crippled object file format.

The root of the current nonsense is that perfectly-fine data structures
are arbitrarily required to be flattened into some kind of intermediate
form, written to some file (or sent down some pipe), often with loss of
information, then read from the other end, interpreted, and
reconstituted into other data structures (with incomplete info), then
processed. In many cases, information that didn't make it through the
channel has to be reconstructed (often imperfectly), and then used. Most
of these steps are redundant. If the compiler data structures were
already directly available in the first place, none of this baroque
dance is necessary.


I couldn't agree more.

--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread Andrei Alexandrescu


On 12/17/12 6:11 PM, Rob T wrote:

On Monday, 17 December 2012 at 22:12:01 UTC, Walter Bright wrote:

dmd xx foo.zip

is equivalent to:

unzip foo
dmd xx a.d b/c.d d.obj

P.S. I've also wanted to use .zip files as the .lib file format (!),
as the various .lib formats have nothing over .zip files.


Wow, I think that's exactly what we could use! It serves multiple
optional use cases all at once!

Was there a technical reason for you not getting around towards
implementing, or just a lack of time?


The latter. I wanted to do it in rdmd for ages.

Andrei

Re: Compilation strategy

2012-12-17 Thread Rob T


On Tuesday, 18 December 2012 at 02:48:05 UTC, Walter Bright wrote:

Using standard zip tools is a big plus.


Yes, but why limit yourself in this way?


The easy answer is look at the problems stemming from dmd using 
an object file format on Win32 that nobody else uses.


I definitely didn't say dmd should use a format that no one else 
uses. In fact I actually said the opposite.


Dmd should not be limiting the choices people want to make for 
themselves. In other words, it should allow people to use 
whatever formats they wish to use, and not a format imposed on 
them, such as was pointed out by yourself with the Win32 object 
file format.


When I look at what dmd is, it's monolithic and is very 
unfriendly to extensibility and re-use, so I was suggesting that 
you look at ways to free dmd from the current set of restraints 
that are holding back its use and adoption.


My suggestion was that the compiler should be restructured into a 
re-usable modularized system with built-in user extensibility 
features. Doing that would be a massive improvement and a very 
big step forward.


--rt

Re: Compilation strategy

2012-12-17 Thread Adam Wilson


On Mon, 17 Dec 2012 00:19:51 -0800, deadalnix  wrote:


On Monday, 17 December 2012 at 08:02:12 UTC, Adam Wilson wrote:
With respect to those who hold one ideology above others, trying to  
impose those ideals on another is a great way to ensure animosity. What  
a business does with their code is entirely up to them, and I would  
guess that even Richard Stallman himself would take issue with trying  
to impose an ideology on another person. What does that mean for D  
practically? Using a close-to-home example, imagine if Remedy decided  
that shipping their ENTIRE codebase in .DI files with the product would  
cause them to give away some new rendering trick that they came up with  
that nobody else had. And they decided that this was unacceptable. What  
would they most likely do? Rewrite the project in C++ and tell the D  
community to kindly pound sand.


A license agreement is not enough to stop a thief. And once the new  
trick makes it into the wild, as long as a competitor can honestly say  
they had no idea how they got it (and they probably really don't, as  
they saw it on a legitimate game development website) the hands of the  
legal system are tied.




But that what I say !

I can't stop myself laughing at people that may think any business can  
be based on java, PHP or C#. That is a mere dream ! Such technology will  
simply never get used in companies, because bytecode can be decoded !


I use C# everyday at my job. We have expensive obfuscater's to protect the  
bytecode. Even then it isn't perfect. But it's good enough. With the  
metadata model of .NET this isn't a problem for public API's, just tell  
the obfuscater to ignore everything marked 'public'. However, with DI's  
being a copy of the plaintext source and NOT bytecode you run the risk of  
changing the meaning of program in unintended ways. You see, in CIL (.NET  
bytecode) there are no auto's (var's) or templates (generics) the C#  
compiler does the work of figuring out what the auto type really should be  
or what the templates really are is BEFORE it writes out the IL, so later  
when the obfuscater does its job, there are no templates or auto's for it  
to deal with.


In D, we don't have this option, you either have plaintext, or you have  
binary code, there is no intermediate step like CIL. Hence we can't use  
the obfuscation approach.


--
Adam Wilson
IRC: LightBender
Project Coordinator
The Horizon Project
http://www.thehorizonproject.org/

Re: Compilation strategy

2012-12-17 Thread Rob T


On Tuesday, 18 December 2012 at 02:48:05 UTC, Walter Bright wrote:

Using standard zip tools is a big plus.


Yes, but why limit yourself in this way?


The easy answer is look at the problems stemming from dmd using 
an object file format on Win32 that nobody else uses.


I didn't say dmd should use a format that no one uses. What I did 
say, is that you should not be limiting the choices people want 
to make for themselves.


The current approach is a self-limiting approach that is unable 
to make effective use of the resourcefulness of the D community.


DMD may be open source, but it's a monolithic system that is very 
unfriendly to extensibility and re-use. Look at this thread of 
discussion, it is caused by the inability to make effective use 
out of the compiler in ways should be of absolutely no concern to 
you.


You should be looking at ways of enabling users to be as free and 
creative as they can be, and that means you must let them make 
their own mistakes, not the opposite.


--rt

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 6:40 PM, Rob T wrote:

On Tuesday, 18 December 2012 at 02:17:25 UTC, Walter Bright wrote:

On 12/17/2012 6:13 PM, Rob T wrote:

Your suggestion concerning the use of zip files is a good idea, although you
mention the encryption algo is very weak, but is there any reason to use a weak
encryption algo, and is there even a reason to bother maintaining compatibility
with the common zip format?


Using standard zip tools is a big plus.


Yes, but why limit yourself in this way?


The easy answer is look at the problems stemming from dmd using an object file 
format on Win32 that nobody else uses.

Re: Compilation strategy

2012-12-17 Thread Rob T


On Tuesday, 18 December 2012 at 02:17:25 UTC, Walter Bright wrote:

On 12/17/2012 6:13 PM, Rob T wrote:
Your suggestion concerning the use of zip files is a good 
idea, although you
mention the encryption algo is very weak, but is there any 
reason to use a weak
encryption algo, and is there even a reason to bother 
maintaining compatibility

with the common zip format?


Using standard zip tools is a big plus.


Yes, but why limit yourself in this way? I suppose you could 
provide a choice between different formats, but that's the wrong 
approach. The compiler should instead be restructured to allow D 
users to supply their own functionality in the form of user 
defined plugins, that way you won't have to bother second 
guessing what people need or don't need, or provide generic one 
size fits all solutions that no one likes, and you'll gain an 
army of coders who'll take D into very surprising directions that 
no one could possibly predict.


Another nice fix would be to separate the CTFE interpreter out of 
the compiler as a loadable library so it can be used outside of 
the compiler for embedded D scripting, and possibly even for JIT 
applications.


I expect there are a few more significant improvements that could 
be made simply by making the compiler less monolithic and more 
modularized.


Easier said than done, but it should be done at some point 
because the advantages are very significant.


--rt

Re: Compilation strategy

2012-12-17 Thread Rob T

3) Performance can be improved to (near) native speeds with a 
JIT
compiler. But then you might as well as go native to begin 
with. Why
wait till runtime to do compilation, when it can be done 
beforehand?


The point though is that with a JIT, you can transmit source code 
(or byte code which is smaller in size) over a wire and have it 
execute natively on a client machine. You cannot do that with 
native machine code because the client machine is always an 
unknown target.


But then again, even if we never do this, it makes no 
difference to *me*
-- the current situation is good enough for *me*. The question 
is
whether or not we want to D to be better received by 
enterprises.


Exactly, the *we* part of all this doesn't matter in the 
slightest, it's what the end user wants that matters. If many 
potential D users want to hide their code (even if it's trivially 
hidden), but D won't let them, then they won't use D. It's a very 
simple equation, but holding on to idealisms will often get in 
the way of good sense.


We already had one corporate user complain in here about the 
issue, and for everyone who complains there are dozens more who 
will say nothing at all and just walk away.


--rt

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 6:13 PM, Rob T wrote:

Your suggestion concerning the use of zip files is a good idea, although you
mention the encryption algo is very weak, but is there any reason to use a weak
encryption algo, and is there even a reason to bother maintaining compatibility
with the common zip format?


Using standard zip tools is a big plus.

Re: Compilation strategy

2012-12-17 Thread Rob T


On Tuesday, 18 December 2012 at 01:52:21 UTC, Walter Bright wrote:
If we implement a way of "hiding" implementation details that 
*allows*
CTFE and templates (and thus one up the C++ situation), this 
will create
a stronger incentive for D adoption. It doesn't matter if it's 
not hard

to "unhide" the implementation;


Yes, it does, because we would be lying if we were pretending 
this was an effective solution.


If you can hide the implementation details for other reasons, 
then no such claim need to be made at all, in fact you can 
explicitly warn people that the code is not really hidden should 
they think otherwise.


Your suggestion concerning the use of zip files is a good idea, 
although you mention the encryption algo is very weak, but is 
there any reason to use a weak encryption algo, and is there even 
a reason to bother maintaining compatibility with the common zip 
format? I would expect that there other compression formats that 
could be used.


--rt

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 5:40 PM, Simen Kjaeraas wrote:

.zip already has encryption,


Just for the record, zip file "encryption" is trivially broken, and there are 
free downloadable tools to do that.


About all it will do is keep your kid sister from reading your diary.

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 5:28 PM, H. S. Teoh wrote:

Using PIMPL only helps if you're trying to hide implementation details
of a struct or class. Anything that requires CTFE is out of the
question. Templates are out of the question (this was also true with
C++). This reduces the incentive to adopt D, since they might as well
just stick with C++. We lose.


I've never seen any closed-source companies reticent about using C++ because of 
obfuscation issues, which are the same as for D, so I do not see this as a problem.




If we implement a way of "hiding" implementation details that *allows*
CTFE and templates (and thus one up the C++ situation), this will create
a stronger incentive for D adoption. It doesn't matter if it's not hard
to "unhide" the implementation;


Yes, it does, because we would be lying if we were pretending this was an 
effective solution.



we don't lose anything (having no way to
hide implementation is what we already have), plus it increases our
chances of adoption -- esp. by enterprises, who are generally the kind
of people who even care about this issue in the first place, and who are
the people we *want* to attract. Sounds like a win to me.


We'd lose credibility with them, as people will laugh at us over this.



But then again, even if we never do this, it makes no difference to *me*
-- the current situation is good enough for *me*. The question is
whether or not we want to D to be better received by enterprises.


As I said, C++ is well received by enterprises. This is not an issue.

Re: Compilation strategy

2012-12-17 Thread evilrat


On Tuesday, 18 December 2012 at 01:30:22 UTC, H. S. Teoh wrote:


If we implement a way of "hiding" implementation details that 
*allows*
CTFE and templates (and thus one up the C++ situation), this 
will create
a stronger incentive for D adoption. It doesn't matter if it's 
not hard
to "unhide" the implementation; we don't lose anything (having 
no way to
hide implementation is what we already have), plus it increases 
our
chances of adoption -- esp. by enterprises, who are generally 
the kind
of people who even care about this issue in the first place, 
and who are

the people we *want* to attract. Sounds like a win to me.



i agreed with that, involving big guys is necessary to make 
language live, if we don't than D would become just another fan 
loved language. it's really bad...


well, that's the point.

Re: Compilation strategy

2012-12-17 Thread Simen Kjaeraas


On 2012-12-18, 02:28, H. S. Teoh wrote:


If we implement a way of "hiding" implementation details that *allows*
CTFE and templates (and thus one up the C++ situation), this will create
a stronger incentive for D adoption. It doesn't matter if it's not hard
to "unhide" the implementation; we don't lose anything (having no way to
hide implementation is what we already have), plus it increases our
chances of adoption -- esp. by enterprises, who are generally the kind
of people who even care about this issue in the first place, and who are
the people we *want* to attract. Sounds like a win to me.


.zip already has encryption, and unpacking those files and feeding them to
the compiler should be a rather simple tool. Sure, if someone makes it, it
could probably become part of the distribution. But making it a part of the
compiler seems more than excessive.

--
Simen

Re: Compilation strategy

2012-12-17 Thread H. S. Teoh

On Mon, Dec 17, 2012 at 04:42:13PM -0800, Walter Bright wrote:
> On 12/17/2012 3:03 PM, deadalnix wrote:
[...]
> >And what prevent us from using a bytecode that loose information ?
> 
> I'd turn that around and ask why have a bytecode?
> 
> 
> >As long as it is CTFEable, most people will be happy.
> 
> CTFE needs the type information and AST trees and symbol table.
> Everything needed for decompilation.
> 
> I know that bytecode has been around since 1995 in its current
> incarnation, and there's an ingrained assumption that since there's
> such an extensive ecosystem around it, that there is some advantage
> to it.
> 
> But there isn't.

Now this, I have to agree with. The only advantage to bytecode is that
if you have two interpreters on two different platforms, then bytecode
on one can run verbatim on the other. But:

1) Bytecode is slower than native code, and always will be.

2) Unless, of course, you're running a machine that runs the bytecode
directly. But that just means your code is native to that machine, and
the interpreters on other machines are emulators. So you're already
using native code anyway. And since you're already at it, might as well
just use native code on the other machines, too.

3) Performance can be improved to (near) native speeds with a JIT
compiler. But then you might as well as go native to begin with. Why
wait till runtime to do compilation, when it can be done beforehand?

4) Bytecode cannot be (easily) linked with native libraries. Various
wrappers and other workarounds are necessary. The bytecode/native
boundary is often inefficient, because generally there's need of
translation between bytecode interpreter data types and native data
types.

5) There are other issues, but I can't be bothered to think of them
right now.

But anyway, this is getting a bit off-topic. The original issue was
separate compilation, and .di files.

Just for the record, I'd like to state that I am *not* convinced about
the need to obfuscate library code (either by using .di or by other
means), primarily because it's futile, but also because I believe in
open source code. However, I know a LOT of employers and enterprises are
NOT comfortable with the idea, and would not so much as consider a
particular language/toolchain if they can't at least have the illusion
of security.  You may say it's silly, and I'd agree, but that does
nothing to help adoption.

Using PIMPL only helps if you're trying to hide implementation details
of a struct or class. Anything that requires CTFE is out of the
question. Templates are out of the question (this was also true with
C++). This reduces the incentive to adopt D, since they might as well
just stick with C++. We lose.

If we implement a way of "hiding" implementation details that *allows*
CTFE and templates (and thus one up the C++ situation), this will create
a stronger incentive for D adoption. It doesn't matter if it's not hard
to "unhide" the implementation; we don't lose anything (having no way to
hide implementation is what we already have), plus it increases our
chances of adoption -- esp. by enterprises, who are generally the kind
of people who even care about this issue in the first place, and who are
the people we *want* to attract. Sounds like a win to me.

But then again, even if we never do this, it makes no difference to *me*
-- the current situation is good enough for *me*. The question is
whether or not we want to D to be better received by enterprises.

T

-- 
I am a consultant. My job is to make your job redundant. -- Mr Tom

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 4:47 PM, deadalnix wrote:

On Tuesday, 18 December 2012 at 00:42:13 UTC, Walter Bright wrote:

On 12/17/2012 3:03 PM, deadalnix wrote:

I know that. I not arguing against that. I'm arguing against the fact that this
is a blocker. This is blocker in very few use cases in fact. I just look at the
whole picture here. People needing that are the exception, not the rule.


I'm not sure what you mean. A blocker for what?



And what prevent us from using a bytecode that loose information ?


I'd turn that around and ask why have a bytecode?



Because it is CTFEable efficiently, without requiring either to recompile the
source code or even distribute the source code.


I've addressed that issue several times now. I know I'm arguing against scores 
of billions of dollars invested in JVM bytecode, but the emperor isn't wearing 
clothes.




As long as it is CTFEable, most people will be happy.


CTFE needs the type information and AST trees and symbol table. Everything
needed for decompilation.



You do not need more information that what is in a di file.


Yeah, which is source code. I think you just conceded :-)



Java and C# put more
info in that because of runtime reflection (and still, they are tools to strip
most of it, no type info, granted, but everything else), something we don't 
need.


There's nothing to be stripped from .class files without rendering them 
unusable.

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 3:27 PM, Denis Koroskin wrote:

On Mon, 17 Dec 2012 13:47:36 -0800, Walter Bright 
wrote:


I've often thought Java bytecode was a complete joke. It doesn't deliver any
of its promises. You could tokenize Java source code, run the result through
an lzw compressor, and get the equivalent functionality in every way.



Not true at all. Bytecode is semi-optimized,


I'm not unaware of that, recall I wrote a Java compiler. The "semi-optimized" is 
generous. The bytecode simply doesn't allow for any significant optimization.




easier to manipulate with (obfuscate, instrument, etc),


The obfuscators for bytecode are ineffective. It's probably marginally easier to 
instrument, but adjusting a Java compiler to emit instrumented code is just as easy.




JVM/CLR bytecode is shared by many languages
(Java, Scala/C#,F#) so you don't need a separate parser for each language,


That's true. But since there's a 1:1 correspondence between bytecode and Java, 
you can just as easily emit Java from your backend.




and there is hardware that supports running JVM bytecode on the metal.


There's a huge problem with that approach. Remember I said that bytecode can't 
be more than trivially optimized? In hardware, there's no optimization, so it's 
going to be doomed to slow optimization. Even a trivial JIT will beat it, and if 
you go beyond basic code generation to using a real optimizer, that will beat 
the pants off of any hardware bytecode machine.


Which is why such machines have not caught on. It's the wrong place in the 
compilation process to put the hardware.



Try doing the same with lzw'd source code.


Modern CPU design is heavily influenced by the kind of instructions compilers 
like to emit. So, in a sense, this is already the case and has been for decades.


(Note the disuse of some instructions in the 8086 that compilers never emit, and 
their consequent relegation to having as little silicon as possible reserved for 
them, and the consequent caveats to "never use those instructions, they are 
terribly slow".)

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 3:11 PM, Rob T wrote:

I suspect most file transport protocols already compress the data, so
compressing it ourselves probably accomplishes nothing. There are also
compressed filesystems, so storing files in a compressed manner likely
accomplishes little.


Yes however my understanding is that html based file transfers are often not
compressed despite the protocol specifically supporting the feature. The problem
is not with the protocol, its that some clients and servers simply do not
implement the feature or are in a misconfigured state. HTML as you know is very
widely used for transferring files.


I don't think fixing misconfigured HTML servers is something D should address.



Another thing to consider, is for using byte code for interpretation, that way D
could be used directly in game engines in place of LUA or other scripting
methods, or even as a replacement for Java Script. Of course you know best if
this is practical for a language like D, but maybe a subset of D is practical, I
don't know.


Again, there is zero advantage over using a bytecode for this rather than using 
source code. Recall that CTFE is an interpreter. (It has some efficiency 
problems, but that is not related to the file format.)


There is no technical reason why tokenized and compressed D source code cannot 
be interpreted and effectively serve the role of "bytecode". I'll come out and 
say that bytecode is probably the biggest software misfeature anyone set a store 
on :-)




Wow, I think that's exactly what we could use! It serves multiple optional use
cases all at once!

Was there a technical reason for you not getting around towards implementing, or
just a lack of time?


There always seemed something more important to be doing, and Andrei thought it 
would be better to put such a capability in rdmd rather than dmd.

Re: Compilation strategy

2012-12-17 Thread deadalnix


On Tuesday, 18 December 2012 at 00:42:13 UTC, Walter Bright wrote:

On 12/17/2012 3:03 PM, deadalnix wrote:
I know that. I not arguing against that. I'm arguing against 
the fact that this
is a blocker. This is blocker in very few use cases in fact. I 
just look at the
whole picture here. People needing that are the exception, not 
the rule.


I'm not sure what you mean. A blocker for what?


And what prevent us from using a bytecode that loose 
information ?


I'd turn that around and ask why have a bytecode?



Because it is CTFEable efficiently, without requiring either to 
recompile the source code or even distribute the source code.





As long as it is CTFEable, most people will be happy.


CTFE needs the type information and AST trees and symbol table. 
Everything needed for decompilation.




You do not need more information that what is in a di file. Java 
and C# put more info in that because of runtime reflection (and 
still, they are tools to strip most of it, no type info, granted, 
but everything else), something we don't need.

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 3:03 PM, deadalnix wrote:

I know that. I not arguing against that. I'm arguing against the fact that this
is a blocker. This is blocker in very few use cases in fact. I just look at the
whole picture here. People needing that are the exception, not the rule.


I'm not sure what you mean. A blocker for what?



And what prevent us from using a bytecode that loose information ?


I'd turn that around and ask why have a bytecode?



As long as it is CTFEable, most people will be happy.


CTFE needs the type information and AST trees and symbol table. Everything 
needed for decompilation.


I know that bytecode has been around since 1995 in its current incarnation, and 
there's an ingrained assumption that since there's such an extensive ecosystem 
around it, that there is some advantage to it.


But there isn't.

Re: Compilation strategy

2012-12-17 Thread H. S. Teoh

On Tue, Dec 18, 2012 at 02:08:55AM +0400, Dmitry Olshansky wrote:
[...]
> I suspect it's one of prime examples where UNIX philosophy of
> combining a bunch of simple (~ dumb) programs together in place of
> one more complex program was taken *far* beyond reasonable lengths.
> 
> Having a pipe-line:
> preprocessor -> compiler -> (still?) assembler -> linker
> 
> where every program tries hard to know nothing about the previous
> ones (and be as simple as possibly can be) is bound to get
> inadequate results on many fronts:
> - efficiency & scalability
> - cross-border error reporting and detection (linker errors? errors
> for expanded macro magic?)
> - cross-file manipulations (e.g. optimization, see _how_ LTO is done in GCC)
> - multiple problems from a loss of information across pipeline*

The problem is not so much the structure preprocessor -> compiler ->
assembler -> linker; the problem is that these logical stages have been
arbitrarily assigned to individual processes residing in their own
address space, communicating via files (or pipes, whatever it may be).

The fact that they are separate processes is in itself not that big of a
problem, but the fact that they reside in their own address space is a
big problem, because you cannot pass any information down the chain
except through rudimentary OS interfaces like files and pipes. Even that
wouldn't have been so bad, if it weren't for the fact that user
interface (in the form of text input / object file format) has also been
conflated with program interface (the compiler has to produce the input
to the assembler, in *text*, and the assembler has to produce object
files that do not encode any direct dependency information because
that's the standard file format the linker expects).

Now consider if we keep the same stages, but each stage is not a
separate program but a *library*. The code then might look, in greatly
simplified form, something like this:

import libdmd.compiler;
import libdmd.assembler;
import libdmd.linker;

void main(string[] args) {
// typeof(asmCode) is some arbitrarily complex data
// structure encoding assembly code, inter-module
// dependencies, etc.
auto asmCode = compiler.lex(args)
.parse()
.optimize()
.codegen();

// Note: no stupid redundant convert to string, parse,
// convert back to internal representation.
auto objectCode = assembler.assemble(asmCode);

// Note: linker has direct access to dependency info,
// etc., carried over from asmCode -> objectCode.
auto executable = linker.link(objectCode);
File output(outfile, "w");
executable.generate(output);
}

Note that the types asmCode, objectCode, executable, are arbitrarily
complex, and may contain lazy-evaluated data structure, references to
on-disk temporary storage (for large projects you can't hold everything
in RAM), etc.. Dependency information in asmCode is propagated to
objectCode, as necessary. The linker has full access to all info the
compiler has access to, and can perform inter-module optimization, etc.,
by accessing information available to the *compiler* front-end, not just
some crippled object file format.

The root of the current nonsense is that perfectly-fine data structures
are arbitrarily required to be flattened into some kind of intermediate
form, written to some file (or sent down some pipe), often with loss of
information, then read from the other end, interpreted, and
reconstituted into other data structures (with incomplete info), then
processed. In many cases, information that didn't make it through the
channel has to be reconstructed (often imperfectly), and then used. Most
of these steps are redundant. If the compiler data structures were
already directly available in the first place, none of this baroque
dance is necessary.

> *Semantic info on interdependency of symbols in a source file is
> destroyed right before the linker and thus each .obj file is
> included as a whole or not at all. Thus all C run-times I've seen
> _sidestep_ this by writing each function in its own file(!). Even
> this alone should have been a clear indication.
> 
> While simplicity (and correspondingly size in memory) of programs
> was the king in 70's it's well past due. Nowadays I think is all
> about getting highest throughput and more powerful features.
[...]

Simplicity is good. Simplicity lets you modularize a very complex piece
of software (a compiler that converts D source code into executables)
into manageable chunks. Simplicity does not require shoe-horning modules
into separate programs with separate address spaces with separate (and
deficient) input/output formats.

The problem isn't with simplicity, the problem is with carrying over the
archaic mapping of c

Re: Compilation strategy

2012-12-17 Thread Denis Koroskin

On Mon, 17 Dec 2012 13:47:36 -0800, Walter Bright  
 wrote:


I've often thought Java bytecode was a complete joke. It doesn't deliver  
any of its promises. You could tokenize Java source code, run the result  
through an lzw compressor, and get the equivalent functionality in every  
way.




Not true at all. Bytecode is semi-optimized, easier to manipulate with  
(obfuscate, instrument, etc), JVM/CLR bytecode is shared by many languages  
(Java, Scala/C#,F#) so you don't need a separate parser for each language,  
and there is hardware that supports running JVM bytecode on the metal. Try  
doing the same with lzw'd source code.

Re: Compilation strategy

2012-12-17 Thread Paulo Pinto


Am 17.12.2012 23:23, schrieb Walter Bright:

On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:

I really loved the way Turbo Pascal units were made. I wish D go the same
route.  Object files would then be looked at as minimal and stupid
variation of
module where symbols are identified by mangling (not plain meta data
as (would
be) in module) and no source for templates is emitted.
+1


I'll bite. How is this superior to D's system? I have never used TP.



Just explaining the TP way, not doing comparisons.

Each unit (module) is a single file and contains all declarations, there 
is a separation between the public and implementation part.


Multiple units can be circular dependent, if they depend between each 
other on the implementation part.


The compiler and IDE are able to extract all the necessary information 
from a unit file, thus making a single file all that is required for 
making the compiler happy and avoiding synchronization errors.


Like any language using modules, the compiler is pretty fast and uses an
included linker optimized for the information stored in the units.

Besides the IDE, there are command line utilities that dump the public 
information of a given unit, as a way for programmers to read the 
available exported API.


Basically not much different from what Java and .NET do, but with a 
language that by default uses native compilation tooling.


--
Paulo

Re: Compilation strategy

2012-12-17 Thread Rob T


On Monday, 17 December 2012 at 22:12:01 UTC, Walter Bright wrote:

On 12/17/2012 1:53 PM, Rob T wrote:
I mentioned in a previous post that we should perhaps focus on 
making the .di

concept more efficient rather than focus on obfuscation.


We're not going to do obfuscation, because as I explained such 
cannot work, and we shouldn't do a disservice to users by 
pretending it does. There are many ways that *do* work, such as 
PIMPL, which work today and should be used by any organization 
wishing to obfuscate their implementation.


I agree.



I suspect most file transport protocols already compress the 
data, so compressing it ourselves probably accomplishes 
nothing. There are also compressed filesystems, so storing 
files in a compressed manner likely accomplishes little.


Yes however my understanding is that html based file transfers 
are often not compressed despite the protocol specifically 
supporting the feature. The problem is not with the protocol, its 
that some clients and servers simply do not implement the feature 
or are in a misconfigured state. HTML as you know is very widely 
used for transferring files.


Another thing to consider, is for using byte code for 
interpretation, that way D could be used directly in game engines 
in place of LUA or other scripting methods, or even as a 
replacement for Java Script. Of course you know best if this is 
practical for a language like D, but maybe a subset of D is 
practical, I don't know.


I have toyed with the idea many times, however, of having dmd 
support zip files. Zip files can contain an arbitrary file 
hierarchy, with individual files in compressed, encrypted, or 
plaintext at the selection of the zip builder. An entire 
project, or library, or collection of source modules can be 
distributed as a zip file, and could be compiled with nothing 
more than:


   dmd foo.zip

or:

   dmd myfile ThirdPartyLib.zip

and have it work. The advantage here is simply that everything 
can be contained in one simple file.


The concept is simple. The files in the zip file simply replace 
the zip file in the command line. So, if foo.zip contains a.d, 
b/c.d, and d.obj, then:


   dmd xx foo.zip

is equivalent to:

   unzip foo
   dmd xx a.d b/c.d d.obj

P.S. I've also wanted to use .zip files as the .lib file format 
(!), as the various .lib formats have nothing over .zip files.


Wow, I think that's exactly what we could use! It serves multiple 
optional use cases all at once!


Was there a technical reason for you not getting around towards 
implementing, or just a lack of time?


--rt

Re: Compilation strategy

2012-12-17 Thread deadalnix


On Monday, 17 December 2012 at 21:36:46 UTC, Walter Bright wrote:

On 12/17/2012 12:49 PM, deadalnix wrote:
Granted, this is still easier than assembly, but you neglected 
the fact that
java is rather simple, where D isn't. It is unlikely that an 
optimized D

bytecode can ever be decompiled in a satisfying way.


Please listen to me.

You have FULL TYPE INFORMATION in the Java bytecode.


That is not true for scalla or clojure. Java bytecode don't allow 
to express closure and similar concept. Decoded scalla bytecode 
is frankly hard to understand.


Java bytecode is nice to decompile java, nothing else.

You have ZERO, ZERO, ZERO type information in object code. 
(Well, you might be able to extract some from mangled global 
symbol names, for C++ and D (not C), if they haven't been 
stripped.) Do not underestimate what the loss of ALL the type 
information means to be able to do meaningful decompilation.


Please understand that I actually do know what I'm talking 
about with this stuff. I have written a Java compiler. I know 
what it emits. I know what's in Java bytecode, and how it is 
TRIVIALLY reversed back into Java source.




I know that. I not arguing against that. I'm arguing against the 
fact that this is a blocker. This is blocker in very few use 
cases in fact. I just look at the whole picture here. People 
needing that are the exception, not the rule.


And what prevent us from using a bytecode that loose information 
? As long as it is CTFEable, most people will be happy.

Re: Compilation strategy

2012-12-17 Thread foobar


On Monday, 17 December 2012 at 22:12:01 UTC, Walter Bright wrote:

On 12/17/2012 1:53 PM, Rob T wrote:
I mentioned in a previous post that we should perhaps focus on 
making the .di

concept more efficient rather than focus on obfuscation.


We're not going to do obfuscation, because as I explained such 
cannot work, and we shouldn't do a disservice to users by 
pretending it does. There are many ways that *do* work, such as 
PIMPL, which work today and should be used by any organization 
wishing to obfuscate their implementation.



Shorter file sizes is a potential use case, and you could even 
allow a
distributor of byte code to optionally supply in compressed 
form that is
automatically uncompressed when compiling, although as a trade 
off that would

add on a small compilation performance hit.


I suspect most file transport protocols already compress the 
data, so compressing it ourselves probably accomplishes 
nothing. There are also compressed filesystems, so storing 
files in a compressed manner likely accomplishes little.


I have toyed with the idea many times, however, of having dmd 
support zip files. Zip files can contain an arbitrary file 
hierarchy, with individual files in compressed, encrypted, or 
plaintext at the selection of the zip builder. An entire 
project, or library, or collection of source modules can be 
distributed as a zip file, and could be compiled with nothing 
more than:


   dmd foo.zip

or:

   dmd myfile ThirdPartyLib.zip

and have it work. The advantage here is simply that everything 
can be contained in one simple file.


The concept is simple. The files in the zip file simply replace 
the zip file in the command line. So, if foo.zip contains a.d, 
b/c.d, and d.obj, then:


   dmd xx foo.zip

is equivalent to:

   unzip foo
   dmd xx a.d b/c.d d.obj

P.S. I've also wanted to use .zip files as the .lib file format 
(!), as the various .lib formats have nothing over .zip files.


Yes please.
This is successfully used in Java, their Jar files are actually 
zip files that can contain: source code, binary files, resources, 
documentation, package meta-data, etc. This can store multiple 
binaries inside to support multi-arch (as in OSX).
.NET assemblies accomplish similar goals (although I don't know 
if they are zip archives internally as well). D can support both 
"legacy" C/C++ compatible formats (lib/obj/headers) and this new 
dlib format.

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:

I really loved the way Turbo Pascal units were made. I wish D go the same
route.  Object files would then be looked at as minimal and stupid variation of
module where symbols are identified by mangling (not plain meta data as (would
be) in module) and no source for templates is emitted.
+1


I'll bite. How is this superior to D's system? I have never used TP.



*Semantic info on interdependency of symbols in a source file is destroyed right
before the linker and thus each .obj file is included as a whole or not at all.
Thus all C run-times I've seen _sidestep_ this by writing each function in its
own file(!). Even this alone should have been a clear indication.


This is done using COMDATs in C++ and D today.

Re: Compilation strategy

2012-12-17 Thread bearophile


Walter Bright:


   dmd foo.zip

or:

   dmd myfile ThirdPartyLib.zip

and have it work. The advantage here is simply that everything 
can be contained in one simple file.


This was discussed a lot of time ago (even using a "rock" suffix 
for those zip files) and it seems a nice idea.


Bye,
bearophile

Re: Compilation strategy

2012-12-17 Thread Andrej Mitrovic

On 12/17/12, Walter Bright  wrote:
> I have toyed with the idea many times, however, of having dmd support zip
> files.

I think such a feature is better suited for RDMD. Then many other
compilers would benefit, since RDMD can be used with GDC and LDC.

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 1:53 PM, Rob T wrote:

I mentioned in a previous post that we should perhaps focus on making the .di
concept more efficient rather than focus on obfuscation.


We're not going to do obfuscation, because as I explained such cannot work, and 
we shouldn't do a disservice to users by pretending it does. There are many ways 
that *do* work, such as PIMPL, which work today and should be used by any 
organization wishing to obfuscate their implementation.




Shorter file sizes is a potential use case, and you could even allow a
distributor of byte code to optionally supply in compressed form that is
automatically uncompressed when compiling, although as a trade off that would
add on a small compilation performance hit.


I suspect most file transport protocols already compress the data, so 
compressing it ourselves probably accomplishes nothing. There are also 
compressed filesystems, so storing files in a compressed manner likely 
accomplishes little.


I have toyed with the idea many times, however, of having dmd support zip files. 
Zip files can contain an arbitrary file hierarchy, with individual files in 
compressed, encrypted, or plaintext at the selection of the zip builder. An 
entire project, or library, or collection of source modules can be distributed 
as a zip file, and could be compiled with nothing more than:


   dmd foo.zip

or:

   dmd myfile ThirdPartyLib.zip

and have it work. The advantage here is simply that everything can be contained 
in one simple file.


The concept is simple. The files in the zip file simply replace the zip file in 
the command line. So, if foo.zip contains a.d, b/c.d, and d.obj, then:


   dmd xx foo.zip

is equivalent to:

   unzip foo
   dmd xx a.d b/c.d d.obj

P.S. I've also wanted to use .zip files as the .lib file format (!), as the 
various .lib formats have nothing over .zip files.

Re: Compilation strategy

2012-12-17 Thread Dmitry Olshansky


12/18/2012 12:34 AM, Paulo Pinto пишет:

Am 17.12.2012 21:09, schrieb foobar:

On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:

On 2012-12-17 03:18:45 +, Walter Bright
 said:


Whether the file format is text or binary does not make any
fundamental difference.


I too expect the difference in performance to be negligible in binary
form if you maintain the same structure. But if you're translating it
to another format you can improve the structure to make it faster.

If the file had a table of contents (TOC) of publicly visible symbols
right at the start, you could read that table of content alone to fill
symbol tables while lazy-loading symbol definitions from the file only
when needed.

Often, most of the file beyond the TOC wouldn't be needed at all.
Having to parse and construct the syntax tree for the whole file
incurs many memory allocations in the compiler, which you could avoid
if the file was structured for lazy-loading. With a TOC you have very
little to read from disk and very little to allocate in memory and
that'll make compilation faster.

More importantly, if you use only fully-qualified symbol names in the
translated form, then you'll be able to load lazily privately imported
modules because they'll only be needed when you need the actual
definition of a symbol. (Template instantiation might require loading
privately imported modules too.)

And then you could structure it so a whole library could fit in one
file, putting all the TOCs at the start of the same file so it loads
from disk in a single read operation (or a couple of *sequential*
reads).

I'm not sure of the speedup all this would provide, but I'd hazard a
guess that it wouldn't be so negligible when compiling a large project
incrementally.

Implementing any of this in the current front end would be a *lot* of
work however.


Precisely. That is the correct solution and is also how [turbo?] pascal
units (==libs) where implemented *decades ago*.

I'd like to also emphasize the importance of using a *single*
encapsulated file. This prevents synchronization hazards that D
inherited from the broken c/c++ model.




I really loved the way Turbo Pascal units were made. I wish D go the 
same route.  Object files would then be looked at as minimal and stupid 
variation of module where symbols are identified by mangling (not plain 
meta data as (would be) in module) and no source for templates is emitted.


AFAIK Delphi is able to produce both DCU and OBJ files (and link with). 
Dunno what it does with generics (and which kind these are) and how.



I really miss it, but at least it has been picked up by Go as well.

Still find strange that many C and C++ developers are unaware that we
have modules since the early 80's.


+1

I suspect it's one of prime examples where UNIX philosophy of combining 
a bunch of simple (~ dumb) programs together in place of one more 
complex program was taken *far* beyond reasonable lengths.


Having a pipe-line:
preprocessor -> compiler -> (still?) assembler -> linker

where every program tries hard to know nothing about the previous ones 
(and be as simple as possibly can be) is bound to get inadequate results 
on many fronts:

- efficiency & scalability
- cross-border error reporting and detection (linker errors? errors for 
expanded macro magic?)

- cross-file manipulations (e.g. optimization, see _how_ LTO is done in GCC)
- multiple problems from a loss of information across pipeline*

*Semantic info on interdependency of symbols in a source file is 
destroyed right before the linker and thus each .obj file is included as 
a whole or not at all. Thus all C run-times I've seen _sidestep_ this by 
writing each function in its own file(!). Even this alone should have 
been a clear indication.


While simplicity (and correspondingly size in memory) of programs was 
the king in 70's it's well past due. Nowadays I think is all about 
getting highest throughput and more powerful features.



--
Paulo



--
Dmitry Olshansky

Re: Compilation strategy

2012-12-17 Thread Rob T


On Monday, 17 December 2012 at 21:47:36 UTC, Walter Bright wrote:


There is no substantive difference between bytecode and source 
code, as I've been trying to explain. It is not superior in any 
way, (other than being shorter, and hence less costly to 
transmit over the internet).




I mentioned in a previous post that we should perhaps focus on 
making the .di concept more efficient rather than focus on 
obfuscation.


Shorter file sizes is a potential use case, and you could even 
allow a distributor of byte code to optionally supply in 
compressed form that is automatically uncompressed when 
compiling, although as a trade off that would add on a small 
compilation performance hit.


--rt

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 12:51 PM, deadalnix wrote:

On Monday, 17 December 2012 at 09:40:22 UTC, Walter Bright wrote:

On 12/17/2012 12:54 AM, deadalnix wrote:

More seriously, I understand that in some cases, di are interesting. Mostly if
you want to provide a closed source library to be used by 3rd party devs.


You're missing another major use - encapsulation and isolation, reducing the
dependencies between parts of your system.



For such a case, bytecode is a superior solution as it would allow CTFE.


There is no substantive difference between bytecode and source code, as I've 
been trying to explain. It is not superior in any way, (other than being 
shorter, and hence less costly to transmit over the internet).


I've also done precompiled headers for C and C++, which are more or less a 
binary module importation format.


So, I have extensive personal experience with:

1. bytecode modules
2. binary symbolic modules
3. modules as source code

I picked (3) for D, based on real experience with other methods of doing it. (3) 
really is the best solution.


I've often thought Java bytecode was a complete joke. It doesn't deliver any of 
its promises. You could tokenize Java source code, run the result through an lzw 
compressor, and get the equivalent functionality in every way.


And yes, you can do the same with D modules. Tokenize, run through an lzw 
compressor, and voila! a "binary" module import format that is small, loads 
fast, and "obfuscated", for whatever little that is worth.

Re: Compilation strategy

2012-12-17 Thread Rob T


On Monday, 17 December 2012 at 12:54:46 UTC, jerro wrote:
If we want to allow D to fit into various niche markets 
overlooked by C++, for added security, encryption could be 
added, where the person compiling encrypted .di files would 
have to supply a key. That would work only for certain 
situations, not for mass distribution, but it may be useful to 
enough people.


I can't imagine a situation where encrypting .di files would 
make any sense. Such files would be completely useless without 
the key, so you would have to either distribute the key along 
with the files or the compiler would need to contain the key. 
The former obviously makes encryption pointless and you could 
only make the latter work by attempting to hide the key inside 
the compiler. The fact that the compiler is open source would 
make that harder and someone would eventually manage to extract 
the key in any case. This whole DRM business would also prevent 
D from ever being added to GCC.


Of course open source code would never be encrypted, I was 
suggesting an entirely optional convenience feature for users of 
the compiler and not a general method of storing library files or 
for providing a fool proof method for the mass distribution of 
hidden content.


Having such a feature would allow a company or individual to 
package up their source code in a way that no one could look at 
with out a specific key. It does not matter if the compiler is 
open source or not, only a user with the correct key could 
potentially decrypt the contents in a way that was unintended.


Obviously anyone who has enough skills and the correct key to a 
specific encrypted package could decrypt the contents of that 
specific package (and then post it on usenet or bt or in a 
million+1 other ways), but you would still need access to the key 
and you would need access to a tool that decrypts the contents 
(such as the compiler itself), but that's what security is all 
about, it's simply a set of barriers that make it difficult, but 
not impossible, to break through. All security systems are 
breakable, end of story, no debate there, just take a look around 
you and see for yourself.


The difference between packaging in an encrypted archive, which 
is later decrypted and installed for use, is that in this case 
the source code is never left lying around in a form that is 
decrypted, it is also more secure because the source data is 
decrypted only when it is being compiled, and the decrypted 
content is immediately discarded (in a secure way) afterwards.


BTW, for the record I'm no fan of DRM in the general sense, but 
many companies think they need to lock out prying eyes and it's 
not my place to tell them that they should not be worried about 
it and fully open up their doors to whatever content they want to 
distribute.


--rt

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 12:49 PM, deadalnix wrote:

Granted, this is still easier than assembly, but you neglected the fact that
java is rather simple, where D isn't. It is unlikely that an optimized D
bytecode can ever be decompiled in a satisfying way.


Please listen to me.

You have FULL TYPE INFORMATION in the Java bytecode.

You have ZERO, ZERO, ZERO type information in object code. (Well, you might be 
able to extract some from mangled global symbol names, for C++ and D (not C), if 
they haven't been stripped.) Do not underestimate what the loss of ALL the type 
information means to be able to do meaningful decompilation.


Please understand that I actually do know what I'm talking about with this 
stuff. I have written a Java compiler. I know what it emits. I know what's in 
Java bytecode, and how it is TRIVIALLY reversed back into Java source.


The only difference between Java source code and Java bytecode is the latter has 
local symbol names and comments stripped out. There's a 1:1 correspondence.


This is not at all true with object code.

(Because .class files have full type information, a Java compiler can "import" 
either a .java file or a .class file with equal facility.)

Re: Compilation strategy

2012-12-17 Thread Rob T


On Monday, 17 December 2012 at 20:31:08 UTC, Paulo Pinto wrote:
But if there was interest, I am sure there could be a way to 
store the
template information in the compiled module, while exposing the 
required type parameters for the template in the .di file, a la 
Ada.


--
Paulo


Maybe the focus should not be on obfuscation directly, but on 
making the .di packaging system perform better. If library 
content can be packaged in a more efficient way through the use 
of D interface files, then at least there's some practical use 
for it that may one day get implemented.


--rt

Re: Compilation strategy

2012-12-17 Thread deadalnix


On Monday, 17 December 2012 at 09:40:22 UTC, Walter Bright wrote:

On 12/17/2012 12:54 AM, deadalnix wrote:
More seriously, I understand that in some cases, di are 
interesting. Mostly if
you want to provide a closed source library to be used by 3rd 
party devs.


You're missing another major use - encapsulation and isolation, 
reducing the dependencies between parts of your system.




For such a case, bytecode is a superior solution as it would 
allow CTFE.


Do you really want to be recompiling the garbage collector for 
every module you compile? It's not because the gc is closed 
source that .di files are useful.

Re: Compilation strategy

2012-12-17 Thread deadalnix


On Monday, 17 December 2012 at 09:37:48 UTC, Walter Bright wrote:

On 12/17/2012 12:55 AM, Paulo Pinto wrote:
Assembly is no different than reversing any other type of 
bytecode:


This is simply not true for Java bytecode.

About the only thing you lose with Java bytecode are local 
variable names. Full type information and the variables 
themselves are intact.




It depends on compiler switch you use. You can strip name, but it 
obviously impact reflection capabilities.


Also, Java is quite easy to decompile due to the very simple 
structure of the language. Even if in some case, optimization can 
confuse the decompiler.


Try on other JVM languages like closure, scale or groovy. The 
produced code can hardly be understood except by a specialist.


Granted, this is still easier than assembly, but you neglected 
the fact that java is rather simple, where D isn't. It is 
unlikely that an optimized D bytecode can ever be decompiled in a 
satisfying way.

Re: Compilation strategy

2012-12-17 Thread Paulo Pinto


Am 16.12.2012 23:32, schrieb Andrej Mitrovic:

On 12/16/12, Paulo Pinto  wrote:

If modules are used correctly, a .di should be created with the public
interface and everything else is already in binary format, thus the
compiler is not really parsing everything all the time.


A lot of D code tends to be templated code, .di files don't help you
in that case.



Why not?

Ada, Modula-3, Eiffel, ML languages are just a few examples of languages 
that support modules and genericity.


So clearly there are some ways of doing it.

Granted, in Ada and Modula-3 case you actually have to define the types
when importing a module, so there is already a difference.

A second issue is that their generic systems are not as powerful as D.

I think that the main issue is that the majority seems to be ok with
having template code in .di files, and that is ok.

But if there was interest, I am sure there could be a way to store the
template information in the compiled module, while exposing the required 
type parameters for the template in the .di file, a la Ada.


--
Paulo

Re: Compilation strategy

2012-12-17 Thread Paulo Pinto


Am 17.12.2012 21:09, schrieb foobar:

On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:

On 2012-12-17 03:18:45 +, Walter Bright
 said:


Whether the file format is text or binary does not make any
fundamental difference.


I too expect the difference in performance to be negligible in binary
form if you maintain the same structure. But if you're translating it
to another format you can improve the structure to make it faster.

If the file had a table of contents (TOC) of publicly visible symbols
right at the start, you could read that table of content alone to fill
symbol tables while lazy-loading symbol definitions from the file only
when needed.

Often, most of the file beyond the TOC wouldn't be needed at all.
Having to parse and construct the syntax tree for the whole file
incurs many memory allocations in the compiler, which you could avoid
if the file was structured for lazy-loading. With a TOC you have very
little to read from disk and very little to allocate in memory and
that'll make compilation faster.

More importantly, if you use only fully-qualified symbol names in the
translated form, then you'll be able to load lazily privately imported
modules because they'll only be needed when you need the actual
definition of a symbol. (Template instantiation might require loading
privately imported modules too.)

And then you could structure it so a whole library could fit in one
file, putting all the TOCs at the start of the same file so it loads
from disk in a single read operation (or a couple of *sequential* reads).

I'm not sure of the speedup all this would provide, but I'd hazard a
guess that it wouldn't be so negligible when compiling a large project
incrementally.

Implementing any of this in the current front end would be a *lot* of
work however.


Precisely. That is the correct solution and is also how [turbo?] pascal
units (==libs) where implemented *decades ago*.

I'd like to also emphasize the importance of using a *single*
encapsulated file. This prevents synchronization hazards that D
inherited from the broken c/c++ model.


I really miss it, but at least it has been picked up by Go as well.

Still find strange that many C and C++ developers are unaware that we 
have modules since the early 80's.


--
Paulo

Re: Compilation strategy

2012-12-17 Thread foobar


On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:
On 2012-12-17 03:18:45 +, Walter Bright 
 said:


Whether the file format is text or binary does not make any 
fundamental difference.


I too expect the difference in performance to be negligible in 
binary form if you maintain the same structure. But if you're 
translating it to another format you can improve the structure 
to make it faster.


If the file had a table of contents (TOC) of publicly visible 
symbols right at the start, you could read that table of 
content alone to fill symbol tables while lazy-loading symbol 
definitions from the file only when needed.


Often, most of the file beyond the TOC wouldn't be needed at 
all. Having to parse and construct the syntax tree for the 
whole file incurs many memory allocations in the compiler, 
which you could avoid if the file was structured for 
lazy-loading. With a TOC you have very little to read from disk 
and very little to allocate in memory and that'll make 
compilation faster.


More importantly, if you use only fully-qualified symbol names 
in the translated form, then you'll be able to load lazily 
privately imported modules because they'll only be needed when 
you need the actual definition of a symbol. (Template 
instantiation might require loading privately imported modules 
too.)


And then you could structure it so a whole library could fit in 
one file, putting all the TOCs at the start of the same file so 
it loads from disk in a single read operation (or a couple of 
*sequential* reads).


I'm not sure of the speedup all this would provide, but I'd 
hazard a guess that it wouldn't be so negligible when compiling 
a large project incrementally.


Implementing any of this in the current front end would be a 
*lot* of work however.


Precisely. That is the correct solution and is also how [turbo?] 
pascal units (==libs) where implemented *decades ago*.


I'd like to also emphasize the importance of using a *single* 
encapsulated file. This prevents synchronization hazards that D 
inherited from the broken c/c++ model.

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-17 16:01, Walter Bright wrote:


Yup. I'd be very surprised if they were based on decompiled Windows
executables.

Not only that, I didn't say decompiling by hand was impossible. I
repeatedly said that it can be done by an expert with a lot of patience.

But not automatically. Java .class files can be done automatically with
free tools.


Fair enough.

--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 3:02 AM, mist wrote:

AFAIK those are more like Windows API & ABI reverse engineered and reimplemented
and that is a huge difference.


Yup. I'd be very surprised if they were based on decompiled Windows executables.

Not only that, I didn't say decompiling by hand was impossible. I repeatedly 
said that it can be done by an expert with a lot of patience.


But not automatically. Java .class files can be done automatically with free 
tools.

Re: Compilation strategy

2012-12-17 Thread Walter Bright

On 12/17/2012 4:38 AM, Jacob Carlborg wrote:> On 2012-12-17 10:58, Walter Bright 
wrote:

>
>> Google "convert object file to C"
>
> A few seconds on Google resulted in this:
>
> http://www.hex-rays.com/products/decompiler/index.shtml
>

hex-rays is an interactive tool. It's "decompile" to things like this:

v81 = 9;
v63 = *(_DWORD *)(v62 + 88);
if ( v63 )
{
   v64 = *(int (__cdecl **)(_DWORD, _DWORD, _DWORD,
   _DWORD, _DWORD))(v63 + 24);
   if ( v64 )
 v62 = v64(v62, v1, *(_DWORD *)(v3 + 16), *(_DWORD
 *)(v3 + 40), bstrString);
}

It has wired in some recognition of patterns that call standard functions like 
strcpy and strlen, but as far as I can see not much else. It's interactive in 
that you have to supply the interpretation.

It's pretty simple to decompile asm by hand into the above, but the work is only 
just beginning. For example, what is v3+16? Some struct member?

Note that there is no type information in the hex-ray output.

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-17 12:20, eles wrote:


I don't know about such frameworks, but the idea that these kind of
files should be handled by the compiler, not by the operating system.
They are not meant to be applications, but libraries.


They are handled by the compiler. GCC has the -framework flag.

https://developer.apple.com/library/mac/#documentation/MacOSX/Conceptual/BPFrameworks/Concepts/WhatAreFrameworks.html

The Finder also know about these frameworks and bundles and treat them 
as a single file.


--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread jerro

It's not as if phobos would be distributed that way.  And even 
it if was,

then there'd be an uproar and a fork of the project.


I don't think that the FSF would be to happy about adding a front 
end with DRM support to GCC, even if no encrypted libraries would 
be added along with it. Of course a fork without DRM support 
could still be added to GCC, but if support for DRM libraries 
became part of D, then this would cause problems when some people 
would choose to actually use this feature with their closed 
source libraries and those couldn't be used with GDC.

Re: Compilation strategy

2012-12-17 Thread Iain Buclaw

On 17 December 2012 12:54, jerro  wrote:

> If we want to allow D to fit into various niche markets overlooked by C++,
>> for added security, encryption could be added, where the person compiling
>> encrypted .di files would have to supply a key. That would work only for
>> certain situations, not for mass distribution, but it may be useful to
>> enough people.
>>
>
> I can't imagine a situation where encrypting .di files would make any
> sense. Such files would be completely useless without the key, so you would
> have to either distribute the key along with the files or the compiler
> would need to contain the key. The former obviously makes encryption
> pointless and you could only make the latter work by attempting to hide the
> key inside the compiler. The fact that the compiler is open source would
> make that harder and someone would eventually manage to extract the key in
> any case. This whole DRM business would also prevent D from ever being
> added to GCC.
>

It's not as if phobos would be distributed that way.  And even it if was,
then there'd be an uproar and a fork of the project.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Re: Compilation strategy

2012-12-17 Thread jerro

If we want to allow D to fit into various niche markets 
overlooked by C++, for added security, encryption could be 
added, where the person compiling encrypted .di files would 
have to supply a key. That would work only for certain 
situations, not for mass distribution, but it may be useful to 
enough people.


I can't imagine a situation where encrypting .di files would make 
any sense. Such files would be completely useless without the 
key, so you would have to either distribute the key along with 
the files or the compiler would need to contain the key. The 
former obviously makes encryption pointless and you could only 
make the latter work by attempting to hide the key inside the 
compiler. The fact that the compiler is open source would make 
that harder and someone would eventually manage to extract the 
key in any case. This whole DRM business would also prevent D 
from ever being added to GCC.

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-17 10:58, Walter Bright wrote:


Google "convert object file to C"


A few seconds on Google resulted in this:

http://www.hex-rays.com/products/decompiler/index.shtml

--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread Paulo Pinto


On Monday, 17 December 2012 at 09:58:28 UTC, Walter Bright wrote:

On 12/17/2012 1:35 AM, Paulo Pinto wrote:
It suffices to get the general algorithm behind the code, and 
that is impossible

to hide, unless the developer resorts to cryptography.


I'll say again, with enough effort, an expert *can* decompile 
object files by hand. You can't make a tool to do that for you, 
though.


It can also be pretty damned challenging to figure out the 
algorithm used in a bit of non-trivial assembler after it's 
gone through a modern compiler optimizer.


I know nobody here wants to believe me, but it is trivial to 
automatically turn Java bytecode back into source code.


Google "convert .class file to .java":

http://java.decompiler.free.fr/

Now try:

Google "convert object file to C"

If you don't believe me, a guy who's been working on C 
compilers for 30 years, and who also wrote a Java compiler, 
that should be a helpful data point.


Of course I believe you and respect your experience.

The point I was trying to make is that if someone really wants 
your code, they will get it, even if that means reading assembly 
instructions manually.


In one company I used to work, we rewrote the TCL parser to read 
encrypted files to avoid delivering text to the customer, hoping 
that it would be enough to detain most people.


--
Paulo

Re: Compilation strategy

2012-12-17 Thread eles

Sounds a lot like frameworks and other type of bundles on Mac 
OS X. A framework is a folder, with the .framework extension, 
containing a dynamic library, header files and all other 
necessary resource files like images and so on.


I don't know about such frameworks, but the idea that these kind 
of files should be handled by the compiler, not by the operating 
system. They are not meant to be applications, but libraries.

Re: Compilation strategy

2012-12-17 Thread mist

AFAIK those are more like Windows API & ABI reverse engineered 
and reimplemented and that is a huge difference.


On Monday, 17 December 2012 at 10:01:35 UTC, Jacob Carlborg wrote:

On 2012-12-17 09:21, Walter Bright wrote:

I know what I'm talking about with this. The only time they 
get reverse
engineered is when somebody really really REALLY wants to do 
it, an
expert is necessary to do the job, and it's completely 
impractical for
larger sets of files. You cannot build a tool to do it, it 
must be done
by hand, line by line. It's the proverbial turning of 
hamburger back

into a cow.


Evert heard of Wine or ReactOS, it's basically Windows reversed 
engineered.

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-17 10:13, eles wrote:


WRT to all opinions above (ie: binary vs text, what to put etc.)

I had some reflection on that some time ago: how about bundling a
"header" file (that would be the .di file) and a binary file (the
compiled .d file, that is the .obj file) into a single .zip (albeit with
another extension), that will be recognized and processed by the D
compiler (let's name that file a .dobj).

Idea may seem a bit crazy, but consider the following:

-the standard .zip format could be used by a user of that object/library
to learn the interface of the functions provided by the object (just
like a C header file)
-if he's a power user, he can simply extract the .zip/.dobj, modify the
included header (adding comments, for example), then archive that back
and present the compiler a "fresh" .dobj/library file


Sounds a lot like frameworks and other type of bundles on Mac OS X. A 
framework is a folder, with the .framework extension, containing a 
dynamic library, header files and all other necessary resource files 
like images and so on.



The responsability of maintaining the .obj and the header in sync will
be of the compiler or of the power user, if the latter edit it manually.
More, IDEs could simply extract relevant header information from the
.zip archive and use it for code completion, documentation and so all.

Basically, this would be like bundling a .h file with the corresponding
.obj file (if we speak C++), all that under a transparent format. The
code is hidden and obfuscated, just like in a standard library (think
-lstdc++ vs ). The use of a single file greatly facilitate
synchronization, while the use of the standard .zip format allow a
plethora of tools to manually tune the file (if desired).

This can be extended also to entire .dlib (that is, archive of .dobjs),
which can become self-documenting, that way. I kinda of dreamt about
that since programming in C++ and always needed to have the headers and
the libs with me. Why do not include the headers in the lib, in a
transparent and manually readable/editable format?

A checksum could guarantee also that the header information and the
binary information are in sync inside the .zip archive.

What do you think?


In general I think it's better to have a package manager handle this.

--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread Iain Buclaw

On 17 December 2012 09:29, Walter Bright  wrote:

> On 12/17/2012 1:15 AM, Paulo Pinto wrote:
>
>> http://www.hopperapp.com/
>>
>> I really like the way it generates pseudo-code and basic block graphs out
>> of
>> instruction sequences.
>>
>
> I looked at their examples. Sorry, that's just step one of reverse
> engineering an object file. It's a loong way from turning it into
> source code.
>
> For example, consider an optimizer that puts variables int x, class c, and
> pointer p all in register EBX. Figure that one out programmatically. Or the
> result of a CTFE calculation. Or a template after it's been expanded and
> inlined.
>

Right, there is practically zero chance of being able to come up with 100%
identical D code from an object dump / assembly code.  Possibly with
exception to a few *very* simple cases (hello world!).  However it looks
like you might just be able to decode it into a bastardised C version.  I
can't see that hopperapp to be very practical beyond small stuff though...

Regards,
-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-17 09:21, Walter Bright wrote:


I know what I'm talking about with this. The only time they get reverse
engineered is when somebody really really REALLY wants to do it, an
expert is necessary to do the job, and it's completely impractical for
larger sets of files. You cannot build a tool to do it, it must be done
by hand, line by line. It's the proverbial turning of hamburger back
into a cow.


Evert heard of Wine or ReactOS, it's basically Windows reversed engineered.

--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread Jacob Carlborg


On 2012-12-17 09:19, deadalnix wrote:


I can't stop myself laughing at people that may think any business can
be based on java, PHP or C#. That is a mere dream ! Such technology will
simply never get used in companies, because bytecode can be decoded !


Yet there are a lot of business that are based on these languages.

--
/Jacob Carlborg

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 1:35 AM, Paulo Pinto wrote:

It suffices to get the general algorithm behind the code, and that is impossible
to hide, unless the developer resorts to cryptography.


I'll say again, with enough effort, an expert *can* decompile object files by 
hand. You can't make a tool to do that for you, though.


It can also be pretty damned challenging to figure out the algorithm used in a 
bit of non-trivial assembler after it's gone through a modern compiler optimizer.


I know nobody here wants to believe me, but it is trivial to automatically turn 
Java bytecode back into source code.


Google "convert .class file to .java":

http://java.decompiler.free.fr/

Now try:

Google "convert object file to C"

If you don't believe me, a guy who's been working on C compilers for 30 years, 
and who also wrote a Java compiler, that should be a helpful data point.

Re: Compilation strategy

2012-12-17 Thread Walter Bright


On 12/17/2012 1:45 AM, Paulo Pinto wrote:

Pencil and paper?


Yes, as I wrote, you can reverse engineer object files, instruction by 
instruction, by an expert with pencil and paper.


You can't make a tool to do it automatically.

You *can* make such a tool for Java bytecode files, and such free tools appeared 
right after Java was initially released.

1 2 >

1 - 100 of 184 matches

Mail list logo