On 9/28/2011 7:43 PM, Erick Lavoie wrote:
I've found the vision of a simple, open and evolutionary adaptable programming language substrate, as described in Albert [1], tantalizing. I especially like the idea of dynamically evolving a language 'from within' a fluid substrate. I am left wondering the extent to which this vision was realized and its actual benefits.

The commit log of idst [2] shows that work has been done up until 2009 (minus the minor commit in 2010). The 2009 and 2010 NSF reports make no mention of the COLA system, the latter report mentions Nothing as the target for all the other DSLs. No comment has been made on the language(s) used to explore the VMs design. The git mirror for idst and ocean are not accessible on Micheal FIG website anymore. The latest papers citing [1] dates from 2009 [3]. Based on Alan's comment that "it is not an incremental project [...]. We actually throw away much of our code and rewrite with new designs pretty often" [4], I would assume that there is no interest in pursuing the COLA project anymore.


can't say myself, I haven't really looked at much into it...

I personally don't believe much in "throwing away code and starting clean", but I am more one of the more typical "throw together a mountain of code and try to beat it into working" type person.

funny is that sometimes work makes code smaller, and sometimes bigger.
a while ago, I was getting negative kloc (I was trying to better integrate a lot of my code, and rewriting and factoring out a lot of stuff in relevant areas, ...).

however, this trend seems to have reversed again and now I am back into positive-kloc land.


<snip>
nothing I can comment on.
probably better left for people who have worked on this project.


In the present, I am interested in developing a COLA-like system that could serve as an exploration vehicle for research on high-performance meta-circular VMs for dynamic languages. The main objective would be to drastically reduce the amount of effort needed to test new ideas about object representations, calling conventions, memory management, compiler optimizations, etc. and pave the way for more dynamic optimizations.

IMO, a good way to approach this would be the development of general purpose modular components.

personally, I prefer to focus more on representation than on implementation in these regards: one can have concrete and reasonably "standardized" representations for various sorts of data.

this way, one can plug together/replace most parts which deal with these representations.

for example, one can have an assembler:
on x86/x86-64, it accepts, say, NASM style syntax;
on ARM/Thumb, it accepts ARM-style ASM syntax.

then one can be like: can your code produce ASM? well, now you can use the assembler. (there is a "binary interface", which is theoretically a bit faster, but mostly I use the textual interface as it is IMO nicer and is generally "plenty fast enough").

also good would be a nice "general purpose native codegen", but sadly this is a little more complex. there is not really any good IR that exists between the IL and the ASM (probably this is sort of the whole point of an IL/IR...). also my old codegen is a bit complex/nasty, and I have yet to write a new one to replace it.


other projects have gone the other direction, focusing mostly on implementation. for example, LLVM: nearly everything is built on OO and classes. want to build something? overload the relevant classes.

this is a strategy, but I don't agree with its use in the absence of a solid set of canonical data representations. an interface can be nice, but IMO shouldn't replace the data.


either way, supporting the ability to plug-in components or interfaces makes sense. an example is adding new operations or forms to the IR? well, plug in the relevant handlers.

my assembler, linker, GC, ... also allow plugging in interfaces (generally via vtable structs), hence one can add new pseudo-ops to the assember or "pseudo-macros" to the assembler's preprocessor, or add new linker handlers for resolving symbols.

there is actually a system partly built around this: essentially there are pseudo-ops which call back into special code-generation handlers, at link-time, to allow "extending" the basic ASM syntax. this was used, for example, to provide an ASM-level interface to the OO facilities.

in the GC, it is possible to register handlers for implementing new object-specific behaviors, or extending the GC to cover user-managed memory areas (to allow tracing through them, getting dynamic type names, ...).

also, there is a dynamic type-system built largely around user-defined types, and where code creating its own types on-the-fly is common place (this is in contrast to more traditional "tagged reference and magic-number" systems, which although potentially more efficient, are a bit more painful to extend). again, many operations are handled via user-defined vtables.


actually, there are a few places internally where things have been optimized via "plug logic", where essentially the actual general-purpose logic for a dynamic operation would be fairly slow/costly, and so the code "cheats" by using cached function-pointers (this is used, for example, for object vtables, which essentially store both the method handle, and a cached function-pointer to the handler for calling said method).

another nifty hack (AFAICT reasonably unique for dynamic type-systems) is support for pass-by-value types (it does require use of special reference-handling functions though). this is supported by the object-system to allow "value classes" (and faking of RAII semantics).

oh yeah, and one can "optimize" dynamically-typed script code via type-inference (essentially making the "variant" type potentially also sort of like an "auto" type).


FWIW, it all also focuses a lot on C interfacing (facilitated by using tools to mine metadata databases from C code and headers), allowing for semi-transparent C interfacing, as well as some amount of C-level reflection (what is the offset and type of 'this' field within a struct? what is the signature for 'this' function? call 'this' C function pointer with a dynamically typed list or a string as an arguments list, ...).

although, it would be nice to offer similar capabilities for C++ level code, this is non-trivial at the moment (C++ is a bit more complex at both the syntax and ABI level than C).


however, my VM is, sadly, far from clean or simple...
it is likely that people wanting to work on or extend it (or build custom languages) would have a bit of a steep learning curve and a lot of work to do (especially given its near complete lack of adequate documentation, as most of the "documentation" is the source itself...).


I would like the system to:
1. allow interactive and safe development of multiple natively-running object representations and dynamic compilers with full tool support (debugger, profiler, inspector)
2. easily migrate the system to a different implementation language


these are likely non-trivial goals.
any general-purpose solution is likely to be huge and complicated, and thus, by extension, not terribly liable to being ported (try rewriting a non-trivial amount of C code into Java or similar, and feel the pain...). one can build a custom HLL, and proceed to write everything in this, but then they are stuck with this HLL (even if the underlying target moves from C/native to Java/JVM or similar), as by definition, their HLL is their implementation language.


in my case, I am mostly writing everything in C, and if when one can port from C?... who knows.

however, with sufficient modularization and defined data representations, it is possible one can have several implementations in several languages.


a BGBScript implementation in BGBScript would be amusing, but IMO not a terribly practical goal, especially since BS is still a moving target, since it has gone from being JS-like a year ago, to being more AS3 like (package/class/property/... syntax), and with some newer features more reminiscent of C++ (value-classes and copy-constructors, also pointers/pointer-operations/..., for example).

meanwhile, all with a lack of adequate testing and overall poor performance (much of its core is still largely dynamically-typed).

but, after all, I intended the language for high-level scripting, not as a serious implementation language.

so, it much more matters that I can quickly "eval" stuff and load code from source files, than does my ability to have solid reliability or performance or ability to implement it in itself.


The first property would serve to bring the benefits of a live programming environment to VM development (as possible in Squeak, through simulation) and the second would serve a) to facilitate the dissemination of the implementation techniques (including meta-circular VM construction) in existing language communities (Python, JavaScript, Scheme, etc.) and b) obtain expressive and performant notation(s) for VM research without having to start from scratch each time.


would be nice.

current thinking (levels of abstraction):
ASM for ASM-level;
register-machine model IR (optional, maybe not SSA-form as this is IMO overly painful, potentially CPS-form is an option as well); abstract stack-machine IL (stack-machines are easy to work with/target, note: this has almost no real relation to the CPU/machine stack, apart from the word 'stack', and is likely targeted via "bytecode");
S-Expressions or XML based AST syntax (internally, lists or DOM-nodes);
...

a person can then "plug in" to whatever levels they are interested in (invoking parser, then working with the resultant ASTs, feeding custom ASTs into the VM, ...).

this may seem like a lot, but sadly this level of complexity and layering may well be needed for sake of generality.

not that it necessarily requires a multi-Mloc compiler project though (I am doing ok at around 400 kloc for all this, for now, which seems to be "fairly average" for JavaScript-style VMs).


so, why a register-based IR?
this is because generally the most "workable" way to target a frames-based ABI is by treating the stack-frame as a collection of locals and/or temporary variables; also, some calling conventions (such as SysV/AMD64) want lots of stuff passed in CPU registers, which essentially makes a big ugly mess for direct (physical) stack usage (generally, a stack based IL may "fold" its stack operations into operations over temporaries).

why not SSA form? because IMO, SSA form is very painful to produce, can be generated internally if-needed, and its utility is mostly limited to micro-optimizing.

CPS is also nifty, but mixes poorly with typical ABIs/calling-conventions (SysV/AMD64, cdecl, and stdcall, for example...), and so would necessarily require its own ABI (and thunking for calls to/from C-land). so, doesn't really seem worthwhile.

other possibilities include threaded-stacks and split-stacks (Go uses split stacks), which are potentially more "friendly" to standard code (in a split stack, one allocates small stack-frames, and "jumps" to a new stack segment if their call-frame will not fit within the existing stack segment). however, for calls into C-land, this may mean having to ensure a C-friendly stack-segment (potentially costly). a downside: this still does not allow for a good/easy way to implement full continuations.


why then a stack-based IL?
because directly targeting a register-based IR is far more painful than it needs to be, and also because, IMO, a register-machine IR is not the most appropriate level of abstraction for an IL. essentially, a stack-based IL can gloss over the nasty internals of a register IR in much the same way as a high-level syntax glosses over AST-level nastiness (and potentially allow the IR to better express/exploit target-specific properties).

also, the IL->IR stage provides a good place to put ones' type-inference/specialization logic, where at the IL level, one simply "adds two items", and at the IR level, that is when one knows they are adding, say, two integers, say, from T9 and T13 and putting this in T17.

keeping types abstract at the IL level also greatly simplifies targeting it, as then the language front end need not worry as much about the type of every value (the back-end will work this out).

also: one can directly interpret the stack-based IL as well (treating it like a dynamically-typed bytecode).


why not specialize types early (in the parser or language frontend)?
IMO, this does bad things to both flexibility and language semantics.

there may be many things which are very relevant to the backend, which may not be known "effectively" in the frontend without it becoming overly specialized.

at the language level, it would become a tradeoff of:
well, one can either have the types in direct visibility (all declarations visible with mandatory types), or declare the variable as an abstract/dynamic type, and have it be necessarily slower.

if one leaves the types to the backend, it may be able to exploit things like indirect visibility, and also infer dynamically-typed variables which are ever only used with a single type.

this allows the front-end language to be a little "softer", and also express constructions or features which would not otherwise be reasonable/possible with a "direct visibility only" model of static typing (such as in Java, which, by the design of the language and JVM can't, for example, just directly "import" C code and leave it to the VM to figure out how to do all the glue).


why S-Expressions or XML for ASTs?...
because IMO, this "just makes sense".

(vs, say, objects or structs, which IMO just have too many more drawbacks than merits here, unless one is really worried about micro-optimizing their compiler frontend or something...).

S-Exps vs XML is itself its own set of tradeoffs, and neither seems "clearly better".

S-Exps are easy to work with and allow compact code, however they suffer more from "flexibility" issues, and structural changes to S-Exp related code far more often require going and altering other code, or the creation of redundant special-forms and lots of special-cases mostly to deal with minor variations.

XML, being more of a tagging-structure (and having key/value attributes), just works a bit nicer here, however is somewhat more awkward to work with, and tends to eat through a lot more memory (a DOM node may well be a good deal more expensive than a cons-cell).

sadly, my front-ends largely fork on this matter (some use S-Exps and others XML). however, I do have some (mostly untested) code to convert between the formats (mostly originally intended so I could plug one of my XML-based frontends into some of my S-Exp based compiler logic).


I personally prefer C-family ("curly brace") syntax, but there is little special about this at the compiler or VM level, and so I am not really going to argue about syntax-design issues...


Answers to the aforementioned questions would guide my own implementation strategy. I am also interested in pointers to past work that might be relevant to such a pursuit.


ok.

dunno if any of this will have been interesting or helpful.


Erick

[1] http://piumarta.com/papers/albert.pdf
[2] http://piumarta.com/svn2/idst/trunk
[3] http://scholar.google.com/scholar?cites=8937759201081236471&as_sdt=2005&sciodt=1,5&hl=en
[4] http://vpri.org/mailman/private/fonc/2010/001352.html


_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc



_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc

Reply via email to