On 9/28/2011 7:43 PM, Erick Lavoie wrote:
I've found the vision of a simple, open and evolutionary adaptable
programming language substrate, as described in Albert [1],
tantalizing. I especially like the idea of dynamically evolving a
language 'from within' a fluid substrate. I am left wondering the
extent to which this vision was realized and its actual benefits.
The commit log of idst [2] shows that work has been done up until 2009
(minus the minor commit in 2010). The 2009 and 2010 NSF reports make
no mention of the COLA system, the latter report mentions Nothing as
the target for all the other DSLs. No comment has been made on the
language(s) used to explore the VMs design. The git mirror for idst
and ocean are not accessible on Micheal FIG website anymore. The
latest papers citing [1] dates from 2009 [3]. Based on Alan's comment
that "it is not an incremental project [...]. We actually throw away
much of our code and rewrite with new designs pretty often" [4], I
would assume that there is no interest in pursuing the COLA project
anymore.
can't say myself, I haven't really looked at much into it...
I personally don't believe much in "throwing away code and starting
clean", but I am more one of the more typical "throw together a mountain
of code and try to beat it into working" type person.
funny is that sometimes work makes code smaller, and sometimes bigger.
a while ago, I was getting negative kloc (I was trying to better
integrate a lot of my code, and rewriting and factoring out a lot of
stuff in relevant areas, ...).
however, this trend seems to have reversed again and now I am back into
positive-kloc land.
<snip>
nothing I can comment on.
probably better left for people who have worked on this project.
In the present, I am interested in developing a COLA-like system that
could serve as an exploration vehicle for research on high-performance
meta-circular VMs for dynamic languages. The main objective would be
to drastically reduce the amount of effort needed to test new ideas
about object representations, calling conventions, memory management,
compiler optimizations, etc. and pave the way for more dynamic
optimizations.
IMO, a good way to approach this would be the development of general
purpose modular components.
personally, I prefer to focus more on representation than on
implementation in these regards:
one can have concrete and reasonably "standardized" representations for
various sorts of data.
this way, one can plug together/replace most parts which deal with these
representations.
for example, one can have an assembler:
on x86/x86-64, it accepts, say, NASM style syntax;
on ARM/Thumb, it accepts ARM-style ASM syntax.
then one can be like: can your code produce ASM? well, now you can use
the assembler.
(there is a "binary interface", which is theoretically a bit faster, but
mostly I use the textual interface as it is IMO nicer and is generally
"plenty fast enough").
also good would be a nice "general purpose native codegen", but sadly
this is a little more complex. there is not really any good IR that
exists between the IL and the ASM (probably this is sort of the whole
point of an IL/IR...). also my old codegen is a bit complex/nasty, and I
have yet to write a new one to replace it.
other projects have gone the other direction, focusing mostly on
implementation. for example, LLVM: nearly everything is built on OO and
classes. want to build something? overload the relevant classes.
this is a strategy, but I don't agree with its use in the absence of a
solid set of canonical data representations. an interface can be nice,
but IMO shouldn't replace the data.
either way, supporting the ability to plug-in components or interfaces
makes sense.
an example is adding new operations or forms to the IR? well, plug in
the relevant handlers.
my assembler, linker, GC, ... also allow plugging in interfaces
(generally via vtable structs), hence one can add new pseudo-ops to the
assember or "pseudo-macros" to the assembler's preprocessor, or add new
linker handlers for resolving symbols.
there is actually a system partly built around this: essentially there
are pseudo-ops which call back into special code-generation handlers, at
link-time, to allow "extending" the basic ASM syntax. this was used, for
example, to provide an ASM-level interface to the OO facilities.
in the GC, it is possible to register handlers for implementing new
object-specific behaviors, or extending the GC to cover user-managed
memory areas (to allow tracing through them, getting dynamic type names,
...).
also, there is a dynamic type-system built largely around user-defined
types, and where code creating its own types on-the-fly is common place
(this is in contrast to more traditional "tagged reference and
magic-number" systems, which although potentially more efficient, are a
bit more painful to extend). again, many operations are handled via
user-defined vtables.
actually, there are a few places internally where things have been
optimized via "plug logic", where essentially the actual general-purpose
logic for a dynamic operation would be fairly slow/costly, and so the
code "cheats" by using cached function-pointers (this is used, for
example, for object vtables, which essentially store both the method
handle, and a cached function-pointer to the handler for calling said
method).
another nifty hack (AFAICT reasonably unique for dynamic type-systems)
is support for pass-by-value types (it does require use of special
reference-handling functions though).
this is supported by the object-system to allow "value classes" (and
faking of RAII semantics).
oh yeah, and one can "optimize" dynamically-typed script code via
type-inference (essentially making the "variant" type potentially also
sort of like an "auto" type).
FWIW, it all also focuses a lot on C interfacing (facilitated by using
tools to mine metadata databases from C code and headers), allowing for
semi-transparent C interfacing, as well as some amount of C-level
reflection (what is the offset and type of 'this' field within a struct?
what is the signature for 'this' function? call 'this' C function
pointer with a dynamically typed list or a string as an arguments list,
...).
although, it would be nice to offer similar capabilities for C++ level
code, this is non-trivial at the moment (C++ is a bit more complex at
both the syntax and ABI level than C).
however, my VM is, sadly, far from clean or simple...
it is likely that people wanting to work on or extend it (or build
custom languages) would have a bit of a steep learning curve and a lot
of work to do (especially given its near complete lack of adequate
documentation, as most of the "documentation" is the source itself...).
I would like the system to:
1. allow interactive and safe development of multiple natively-running
object representations and dynamic compilers with full tool support
(debugger, profiler, inspector)
2. easily migrate the system to a different implementation language
these are likely non-trivial goals.
any general-purpose solution is likely to be huge and complicated, and
thus, by extension, not terribly liable to being ported (try rewriting a
non-trivial amount of C code into Java or similar, and feel the pain...).
one can build a custom HLL, and proceed to write everything in this, but
then they are stuck with this HLL (even if the underlying target moves
from C/native to Java/JVM or similar), as by definition, their HLL is
their implementation language.
in my case, I am mostly writing everything in C, and if when one can
port from C?... who knows.
however, with sufficient modularization and defined data
representations, it is possible one can have several implementations in
several languages.
a BGBScript implementation in BGBScript would be amusing, but IMO not a
terribly practical goal, especially since BS is still a moving target,
since it has gone from being JS-like a year ago, to being more AS3 like
(package/class/property/... syntax), and with some newer features more
reminiscent of C++ (value-classes and copy-constructors, also
pointers/pointer-operations/..., for example).
meanwhile, all with a lack of adequate testing and overall poor
performance (much of its core is still largely dynamically-typed).
but, after all, I intended the language for high-level scripting, not as
a serious implementation language.
so, it much more matters that I can quickly "eval" stuff and load code
from source files, than does my ability to have solid reliability or
performance or ability to implement it in itself.
The first property would serve to bring the benefits of a live
programming environment to VM development (as possible in Squeak,
through simulation) and the second would serve a) to facilitate the
dissemination of the implementation techniques (including
meta-circular VM construction) in existing language communities
(Python, JavaScript, Scheme, etc.) and b) obtain expressive and
performant notation(s) for VM research without having to start from
scratch each time.
would be nice.
current thinking (levels of abstraction):
ASM for ASM-level;
register-machine model IR (optional, maybe not SSA-form as this is IMO
overly painful, potentially CPS-form is an option as well);
abstract stack-machine IL (stack-machines are easy to work with/target,
note: this has almost no real relation to the CPU/machine stack, apart
from the word 'stack', and is likely targeted via "bytecode");
S-Expressions or XML based AST syntax (internally, lists or DOM-nodes);
...
a person can then "plug in" to whatever levels they are interested in
(invoking parser, then working with the resultant ASTs, feeding custom
ASTs into the VM, ...).
this may seem like a lot, but sadly this level of complexity and
layering may well be needed for sake of generality.
not that it necessarily requires a multi-Mloc compiler project though (I
am doing ok at around 400 kloc for all this, for now, which seems to be
"fairly average" for JavaScript-style VMs).
so, why a register-based IR?
this is because generally the most "workable" way to target a
frames-based ABI is by treating the stack-frame as a collection of
locals and/or temporary variables;
also, some calling conventions (such as SysV/AMD64) want lots of stuff
passed in CPU registers, which essentially makes a big ugly mess for
direct (physical) stack usage (generally, a stack based IL may "fold"
its stack operations into operations over temporaries).
why not SSA form? because IMO, SSA form is very painful to produce, can
be generated internally if-needed, and its utility is mostly limited to
micro-optimizing.
CPS is also nifty, but mixes poorly with typical
ABIs/calling-conventions (SysV/AMD64, cdecl, and stdcall, for
example...), and so would necessarily require its own ABI (and thunking
for calls to/from C-land). so, doesn't really seem worthwhile.
other possibilities include threaded-stacks and split-stacks (Go uses
split stacks), which are potentially more "friendly" to standard code
(in a split stack, one allocates small stack-frames, and "jumps" to a
new stack segment if their call-frame will not fit within the existing
stack segment). however, for calls into C-land, this may mean having to
ensure a C-friendly stack-segment (potentially costly). a downside: this
still does not allow for a good/easy way to implement full continuations.
why then a stack-based IL?
because directly targeting a register-based IR is far more painful than
it needs to be, and also because, IMO, a register-machine IR is not the
most appropriate level of abstraction for an IL.
essentially, a stack-based IL can gloss over the nasty internals of a
register IR in much the same way as a high-level syntax glosses over
AST-level nastiness (and potentially allow the IR to better
express/exploit target-specific properties).
also, the IL->IR stage provides a good place to put ones'
type-inference/specialization logic, where at the IL level, one simply
"adds two items", and at the IR level, that is when one knows they are
adding, say, two integers, say, from T9 and T13 and putting this in T17.
keeping types abstract at the IL level also greatly simplifies targeting
it, as then the language front end need not worry as much about the type
of every value (the back-end will work this out).
also: one can directly interpret the stack-based IL as well (treating it
like a dynamically-typed bytecode).
why not specialize types early (in the parser or language frontend)?
IMO, this does bad things to both flexibility and language semantics.
there may be many things which are very relevant to the backend, which
may not be known "effectively" in the frontend without it becoming
overly specialized.
at the language level, it would become a tradeoff of:
well, one can either have the types in direct visibility (all
declarations visible with mandatory types), or declare the variable as
an abstract/dynamic type, and have it be necessarily slower.
if one leaves the types to the backend, it may be able to exploit things
like indirect visibility, and also infer dynamically-typed variables
which are ever only used with a single type.
this allows the front-end language to be a little "softer", and also
express constructions or features which would not otherwise be
reasonable/possible with a "direct visibility only" model of static
typing (such as in Java, which, by the design of the language and JVM
can't, for example, just directly "import" C code and leave it to the VM
to figure out how to do all the glue).
why S-Expressions or XML for ASTs?...
because IMO, this "just makes sense".
(vs, say, objects or structs, which IMO just have too many more
drawbacks than merits here, unless one is really worried about
micro-optimizing their compiler frontend or something...).
S-Exps vs XML is itself its own set of tradeoffs, and neither seems
"clearly better".
S-Exps are easy to work with and allow compact code, however they suffer
more from "flexibility" issues, and structural changes to S-Exp related
code far more often require going and altering other code, or the
creation of redundant special-forms and lots of special-cases mostly to
deal with minor variations.
XML, being more of a tagging-structure (and having key/value
attributes), just works a bit nicer here, however is somewhat more
awkward to work with, and tends to eat through a lot more memory (a DOM
node may well be a good deal more expensive than a cons-cell).
sadly, my front-ends largely fork on this matter (some use S-Exps and
others XML).
however, I do have some (mostly untested) code to convert between the
formats (mostly originally intended so I could plug one of my XML-based
frontends into some of my S-Exp based compiler logic).
I personally prefer C-family ("curly brace") syntax, but there is little
special about this at the compiler or VM level, and so I am not really
going to argue about syntax-design issues...
Answers to the aforementioned questions would guide my own
implementation strategy. I am also interested in pointers to past
work that might be relevant to such a pursuit.
ok.
dunno if any of this will have been interesting or helpful.
Erick
[1] http://piumarta.com/papers/albert.pdf
[2] http://piumarta.com/svn2/idst/trunk
[3]
http://scholar.google.com/scholar?cites=8937759201081236471&as_sdt=2005&sciodt=1,5&hl=en
[4] http://vpri.org/mailman/private/fonc/2010/001352.html
_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc
_______________________________________________
fonc mailing list
fonc@vpri.org
http://vpri.org/mailman/listinfo/fonc