Re: Transferring control between code segments, eval, and suchlike things

2003-01-24 Thread Nicholas Clark
On Thu, Jan 23, 2003 at 12:11:20AM -0500, Dan Sugalski wrote:

> Every sub doesn't have to fit in a single segment, though. There may 
> well be a half-zillion subs in any one segment. (Though one segment 
> per sub does give us some interesting possibilities for GCing unused 
> code)

For an interpreter that is allowing eval (or a namespace that isn't locked
against eval) I think that you could only GC the old definition of redefined
subroutines, and any anonymous subroutines that become unreferenced.
Anything else is the potential lucky destination of a random future eval.

Nicholas Clark



Re: Transferring control between code segments, eval, and suchlike things

2003-01-23 Thread Juergen Boemmels
Dan Sugalski <[EMAIL PROTECTED]> writes:

> Okay, since this has all come up, here's the scoop from a design perspective.
> 
> First, the branch opcodes (branch, bsr, and the conditionals) are all
> meant for movement within a segment of bytecode. They are *not*
> supposed to leave a segment. To do so was arguably a bad idea, now
> it's officially an error. If you need to do so, branch to an op that
> can transfer across boundaries.
> 
> 
> Design Edict #1: Branches, which is any transfer of control that takes
> an offset, may *not* escape the current bytecode segment.

Okay with that.

> Next, jumps. Jumps take absolute addresses, so either need fixup at
> load time (blech), are only valid in dynamically generated code (okay,
> but limiting), or can only jump to values in registers (that's
> fine). Jumps aren't a problem in general.
> 
> 
> Design Edict #2: Jumps may go anywhere.

In the sense that every possible target (via #3) can be reached with a
jump, but bad things may happen if target isnt valid.

> Destinations. These are a pain, since if we can go anywhere then the
> JIT has to do all sorts of nasty and unpleasant things to compensate,
> and to make every op a valid destination. Yuck.
> 
> 
> Design Edict #3: All destinations *must* be marked as such in the
> bytecode metadata segment. (I am officially nervous about this, as I
> can see a number of ways to subvert this for evil)

This is not more or less evil than 
branch -1
The destinations can be rangechecked at load time, the assembler will
hopefully emit these offsets correct, and they will be read-only after
compilation.

> I'm only keeping jumps (and their corresponding jsr) around for
> nostalgic reasons, and with the vague hope they may be useful. I'm not
> sure about this.
> 
> 
> Design Edict #4: Dan is officially iffy on jumps, but can see them as
> useful for lower-level statically bound languages such as forth,
> Scheme, or C.
> 
> 
> That leads us to
> 
> Design Edict #5: Dan will accommodate semantics for languages outside
> the core set (perl, python, ruby) only if they don't compromise
> performance for the core set.
> 
> 
> Calling actual routines--subs, methods, functions, whatever--at the
> high level isn't done with branches or jumps. It is, instead, done
> with the call series of ops. (call, callmeth, callcc, tailcall,
> tailcallmeth, tailcallcc (though that one makes my head hurt), invoke)
> These are specifically for calling code that's potentially in other
> segments, and to call into them at fixed points. I think these need to
> be hashed out a bit to make them more JIT-friendly, but they're the
> primary transfer destination point

This calls are allways jumps or jsr in disguise. In the end they
always do a goto ADDRESS(something). These means that every
sub/method/continuation must be marked by #3

> Design Edict #6: The first op in a sub is always a valid
> jump/branch/control transfer destination

This is the essentally #3

> Now. Eval. The compile opcode going in is phenomenally cool (thanks,
> Leo!) but has pointed out some holes in the semantics. I got handwavey
> and, well, it shows. No cookie for me.
> 
> 
> The compreg op should compile the passed code in the language that is
> indicated and should load that bytecode into the current
> interpreter. That means that if there are any symbols that get
> installed because someone's defined a sub then, well, they should get
> installed into the interpreter's symbol tables.

Not the compile would install the symbols in the interpreters symbol
table, it would store it somewhere in the bytecode metadata. The eval
should install this in the interpreters symboltable.

The problem really starts if BEGIN {...} blocks are used because they
will be evaluated after the block compiled but before the whole
compile is finished.

> Compiled code is an interesting thing. In some cases it should return
> a sub PMC, in some cases it should execute and return a value, and in
> some cases  it should install a bunch of stuff in a symbol table and
> then return a value. These correspond to:
> 
> 
> 
> eval "print 12";
> 
> $foo = eval "sub bar{return 1;}";
> 
> require foo.pm;
> 
> respectively. It's sort of a mixed bag, and unfortunately we can't
> count on the code doing the compilation to properly handle the
> semantics of the language being compiled. So...
> 
> 
> Design Edict #7: the compreg opcode will execute the compiled code,
> calling in with parrot's calling conventions. If it should return
> something, then it had darned well better build it and return it.

I find it better to leave compile and eval seperate.
The compile opcode should simply return a bytecode-PMC which then can
be invoked sometimes later.

> Oh, and:
> 
> Design Edict #8: compreg is prototyped. It takes a single string and
> must return a single PMC. The compiler may cheat as need be. (No need
> to check and see if it returned a string, or an int)

It should return a bytecodesegment.
 
> Yes, this

Re: Transferring control between code segments, eval, and suchlike things

2003-01-23 Thread Jason Gloudon
On Wed, Jan 22, 2003 at 03:00:37PM -0500, Dan Sugalski wrote:

> Destinations. These are a pain, since if we can go anywhere then the 
> JIT has to do all sorts of nasty and unpleasant things to compensate, 
> and to make every op a valid destination. Yuck.

Arbitrary jumps are not that difficult to deal with in the JIT.  The JIT
compiler can handle jumps to arbitrary addresses by falling back into the
interpreter if the destination does not coincide with a previously known entry
point, reentering the JIT code later at a safe point. pbc2c generated code does
this. This way the JIT does not have to support making every instruction a safe
branch destination.

-- 
Jason



Re: Transferring control between code segments, eval, and suchlike things

2003-01-23 Thread Leopold Toetsch
Benjamin Stuhl wrote:


At 03:00 PM 1/22/2003 -0500, you wrote:



... Although,
all this would seem to suggest that we'd need/want a special-purpose 
allocator for bytecode segments, since every sub has to fit within 
precisely
one segment (and I know _I'd_ like to keep bytecode segments on their 
own memory pages, to e.g. maximize sharing on fork()).


IMHO this is a big waste of memory - and running this page aligned code 
JITted doesn't buy anything.


Design Edict #7: the compreg opcode will execute the compiled code, 
calling in with parrot's calling conventions. If it should return 
something, then it had darned well better build it and return it.


How does this play with

eval 'sub bar { change_foo(); } BEGIN { bar(); }  (...stuff that depends 
on foo...)';

? The semantics of BEGIN{} would seem to require that bar be installed 
into the symbol table immediately... but then how do we reproduce that 
if we're e.g. loading
precompiled bytecode?


Precompiled PBC and eval is a PITA. This issue seems to imply some extra 
parsing during load time and setting up symbols. I dunno yet, how to 
handle this.

leo



Re: Transferring control between code segments, eval, and suchlike things

2003-01-23 Thread Leopold Toetsch
Dan Sugalski wrote:


Okay, since this has all come up, here's the scoop from a design 
perspective.


Hard stuff did meet my printer at midnight, reading it onscreen twice 
didn't help ;-)

First:

Definition #0: A bytecode segment is a sequence of code, which is loaded 
into memory with no execution of such code intersparsed. So all subs, 
modules, whatever loaded from zig files may be one code segment, *if* 
the runloop wasn't entered. Or: as soon as the code is running, loading 
additional bytecode puts this code into a different bytecode segment.


Design Edict #1: Branches, which is any transfer of control that takes 
an offset, may *not* escape the current bytecode segment.


Design Edict #2: Jumps may go anywhere.




Design Edict #3: All destinations *must* be marked as such in the 
bytecode metadata segment. (I am officially nervous about this, as I can 
see a number of ways to subvert this for evil)


I would define: Jumps may go to any location aquired per set_addr call 
or to branch tables. Jumping somewhere else may kill your dog.

Jumping to a set_addr label is recognized already, jump tables may 
probably need some marker around them, so that the jump targets won't 
get killed by dead code elimination.


I'm only keeping jumps (and their corresponding jsr) around for 
nostalgic reasons, and with the vague hope they may be useful. I'm not 
sure about this.


They would be useful for a computed goto.


s/compreg/compile/g for($below);



The compreg op should compile the passed code ...



Design Edict #7: the compreg opcode will execute the compiled code, 
calling in with parrot's calling conventions. If it should return 
something, then it had darned well better build it and return it.


If the compile opcode has to execute the code, I would call it "eval".

But: When compile and eval are separate stages, the HL might be able to 
pull the compile stage out of e.g. loops. So I think keeping compiling 
and evaling separate makes sense.


Thanks for putting this together,
leo




Re: Transferring control between code segments, eval, and suchlike things

2003-01-22 Thread Dan Sugalski
At 6:24 PM -0500 1/22/03, Benjamin Stuhl wrote:

At 03:00 PM 1/22/2003 -0500, you wrote:

Okay, since this has all come up, here's the scoop from a design perspective.

First, the branch opcodes (branch, bsr, and the conditionals) are 
all meant for movement within a segment of bytecode. They are *not* 
supposed to leave a segment. To do so was arguably a bad idea, now 
it's officially an error. If you need to do so, branch to an op 
that can transfer across boundaries.

Design Edict #1: Branches, which is any transfer of control that 
takes an offset, may *not* escape the current bytecode segment.

Seems reasonable. Especially when they bytecode loader may not 
guarantee the relative placement of segments (think mmap()). 
Although,
all this would seem to suggest that we'd need/want a special-purpose 
allocator for bytecode segments, since every sub has to fit within 
precisely
one segment (and I know _I'd_ like to keep bytecode segments on 
their own memory pages, to e.g. maximize sharing on fork()).

Every sub doesn't have to fit in a single segment, though. There may 
well be a half-zillion subs in any one segment. (Though one segment 
per sub does give us some interesting possibilities for GCing unused 
code)

Next, jumps. Jumps take absolute addresses, so either need fixup at 
load time (blech), are only valid in dynamically generated code 
(okay, but limiting), or can only jump to values in registers 
(that's fine). Jumps aren't a problem in general.

Fixups aren't so bad if we make the jump opcode itself take an index 
into a table of fixups (thus letting the bytecode stream stay 
read-only). Register jumps
are dangerous, since parrot can't control what the user code loads 
into the register (while we can theoretically protect the fixup 
table from anything short of
native code).

Indirection. Ick. :)

Though, on the other hand, a jump with an integer constant 
destination is pretty pointless, so we could consider using that to 
index into a jump table. OTOH, it'd be the only thing using the jump 
table, so I'm not sure it's worth it. Might speed things up some. 
I'll think on that for a bit.

Design Edict #2: Jumps may go anywhere.

Destinations. These are a pain, since if we can go anywhere then 
the JIT has to do all sorts of nasty and unpleasant things to 
compensate, and to make every op a valid destination. Yuck.

Design Edict #3: All destinations *must* be marked as such in the 
bytecode metadata segment. (I am officially nervous about this, as 
I can see a number of ways to subvert this for evil)

Marked destinations are very important; as for evil subversion, how 
about just saying "untrusted code only gets pure interpretation, and 
the untrusting interpreter bounds-checks everything"?

True, and we'll not be JITting safe-mode code, or likely not at least 
because of the resource constraint checking.

[snip]

Calling actual routines--subs, methods, functions, whatever--at the 
high level isn't done with branches or jumps. It is, instead, done 
with the call series of ops. (call, callmeth, callcc, tailcall, 
tailcallmeth, tailcallcc (though that one makes my head hurt), 
invoke) These are specifically for calling code that's potentially 
in other segments, and to call into them at fixed points. I think 
these need to be hashed out a bit to make them more JIT-friendly, 
but they're the primary transfer destination point

Design Edict #6: The first op in a sub is always a valid 
jump/branch/control transfer destination

Wouldn't make much sense if you had a sub but couldn't call it, now 
would it? :-D

Don't tempt the JAPHers!


Now. Eval. The compile opcode going in is phenomenally cool 
(thanks, Leo!) but has pointed out some holes in the semantics. I 
got handwavey and, well, it shows. No cookie for me.

The compreg op should compile the passed code in the language that 
is indicated and should load that bytecode into the current 
interpreter. That means that if there are any symbols that get 
installed because someone's defined a sub then, well, they should 
get installed into the interpreter's symbol tables.

Compiled code is an interesting thing. In some cases it should 
return a sub PMC, in some cases it should execute and return a 
value, and in some cases  it should install a bunch of stuff in a 
symbol table and then return a value. These correspond to:


   eval "print 12";

   $foo = eval "sub bar{return 1;}";

   require foo.pm;

respectively. It's sort of a mixed bag, and unfortunately we can't 
count on the code doing the compilation to properly handle the 
semantics of the language being compiled. So...

Design Edict #7: the compreg opcode will execute the compiled code, 
calling in with parrot's calling conventions. If it should return 
something, then it had darned well better build it and return it.

How does this play with

eval 'sub bar { change_foo(); } BEGIN { bar(); }  (...stuff that 
depends on foo...)';

? The semantics of BEGIN{} would seem to require that bar be 
in

Re: Transferring control between code segments, eval, and suchlike things

2003-01-22 Thread Benjamin Stuhl
At 03:00 PM 1/22/2003 -0500, you wrote:

Okay, since this has all come up, here's the scoop from a design perspective.

First, the branch opcodes (branch, bsr, and the conditionals) are all 
meant for movement within a segment of bytecode. They are *not* supposed 
to leave a segment. To do so was arguably a bad idea, now it's officially 
an error. If you need to do so, branch to an op that can transfer across 
boundaries.

Design Edict #1: Branches, which is any transfer of control that takes an 
offset, may *not* escape the current bytecode segment.

Seems reasonable. Especially when they bytecode loader may not guarantee 
the relative placement of segments (think mmap()). Although,
all this would seem to suggest that we'd need/want a special-purpose 
allocator for bytecode segments, since every sub has to fit within precisely
one segment (and I know _I'd_ like to keep bytecode segments on their own 
memory pages, to e.g. maximize sharing on fork()).

Next, jumps. Jumps take absolute addresses, so either need fixup at load 
time (blech), are only valid in dynamically generated code (okay, but 
limiting), or can only jump to values in registers (that's fine). Jumps 
aren't a problem in general.

Fixups aren't so bad if we make the jump opcode itself take an index into a 
table of fixups (thus letting the bytecode stream stay read-only). Register 
jumps
are dangerous, since parrot can't control what the user code loads into the 
register (while we can theoretically protect the fixup table from anything 
short of
native code).

Design Edict #2: Jumps may go anywhere.

Destinations. These are a pain, since if we can go anywhere then the JIT 
has to do all sorts of nasty and unpleasant things to compensate, and to 
make every op a valid destination. Yuck.

Design Edict #3: All destinations *must* be marked as such in the bytecode 
metadata segment. (I am officially nervous about this, as I can see a 
number of ways to subvert this for evil)

Marked destinations are very important; as for evil subversion, how about 
just saying "untrusted code only gets pure interpretation, and the 
untrusting interpreter bounds-checks everything"?

[snip]
Calling actual routines--subs, methods, functions, whatever--at the high 
level isn't done with branches or jumps. It is, instead, done with the 
call series of ops. (call, callmeth, callcc, tailcall, tailcallmeth, 
tailcallcc (though that one makes my head hurt), invoke) These are 
specifically for calling code that's potentially in other segments, and to 
call into them at fixed points. I think these need to be hashed out a bit 
to make them more JIT-friendly, but they're the primary transfer 
destination point

Design Edict #6: The first op in a sub is always a valid 
jump/branch/control transfer destination

Wouldn't make much sense if you had a sub but couldn't call it, now would 
it? :-D

Now. Eval. The compile opcode going in is phenomenally cool (thanks, Leo!) 
but has pointed out some holes in the semantics. I got handwavey and, 
well, it shows. No cookie for me.

The compreg op should compile the passed code in the language that is 
indicated and should load that bytecode into the current interpreter. That 
means that if there are any symbols that get installed because someone's 
defined a sub then, well, they should get installed into the interpreter's 
symbol tables.

Compiled code is an interesting thing. In some cases it should return a 
sub PMC, in some cases it should execute and return a value, and in some 
cases  it should install a bunch of stuff in a symbol table and then 
return a value. These correspond to:


   eval "print 12";

   $foo = eval "sub bar{return 1;}";

   require foo.pm;

respectively. It's sort of a mixed bag, and unfortunately we can't count 
on the code doing the compilation to properly handle the semantics of the 
language being compiled. So...

Design Edict #7: the compreg opcode will execute the compiled code, 
calling in with parrot's calling conventions. If it should return 
something, then it had darned well better build it and return it.

How does this play with

eval 'sub bar { change_foo(); } BEGIN { bar(); }  (...stuff that depends on 
foo...)';

? The semantics of BEGIN{} would seem to require that bar be installed into 
the symbol table immediately... but then how do we reproduce that if we're 
e.g. loading
precompiled bytecode?

Oh, and:

Design Edict #8: compreg is prototyped. It takes a single string and must 
return a single PMC. The compiler may cheat as need be. (No need to check 
and see if it returned a string, or an int)

Yes, this does mean that for plain assembly that we want to compile and 
return a sub ref for we need to do extra in the assembly we pass in. 
Tough, we can deal. If it was dead-simple it wouldn't be assembly. :)

That makes sense.

-- BKS




Transferring control between code segments, eval, and suchlike things

2003-01-22 Thread Dan Sugalski
Okay, since this has all come up, here's the scoop from a design perspective.

First, the branch opcodes (branch, bsr, and the conditionals) are all 
meant for movement within a segment of bytecode. They are *not* 
supposed to leave a segment. To do so was arguably a bad idea, now 
it's officially an error. If you need to do so, branch to an op that 
can transfer across boundaries.

Design Edict #1: Branches, which is any transfer of control that 
takes an offset, may *not* escape the current bytecode segment.

Next, jumps. Jumps take absolute addresses, so either need fixup at 
load time (blech), are only valid in dynamically generated code 
(okay, but limiting), or can only jump to values in registers (that's 
fine). Jumps aren't a problem in general.

Design Edict #2: Jumps may go anywhere.

Destinations. These are a pain, since if we can go anywhere then the 
JIT has to do all sorts of nasty and unpleasant things to compensate, 
and to make every op a valid destination. Yuck.

Design Edict #3: All destinations *must* be marked as such in the 
bytecode metadata segment. (I am officially nervous about this, as I 
can see a number of ways to subvert this for evil)

I'm only keeping jumps (and their corresponding jsr) around for 
nostalgic reasons, and with the vague hope they may be useful. I'm 
not sure about this.

Design Edict #4: Dan is officially iffy on jumps, but can see them as 
useful for lower-level statically bound languages such as forth, 
Scheme, or C.

That leads us to

Design Edict #5: Dan will accommodate semantics for languages outside 
the core set (perl, python, ruby) only if they don't compromise 
performance for the core set.

Calling actual routines--subs, methods, functions, whatever--at the 
high level isn't done with branches or jumps. It is, instead, done 
with the call series of ops. (call, callmeth, callcc, tailcall, 
tailcallmeth, tailcallcc (though that one makes my head hurt), 
invoke) These are specifically for calling code that's potentially in 
other segments, and to call into them at fixed points. I think these 
need to be hashed out a bit to make them more JIT-friendly, but 
they're the primary transfer destination point

Design Edict #6: The first op in a sub is always a valid 
jump/branch/control transfer destination

Now. Eval. The compile opcode going in is phenomenally cool (thanks, 
Leo!) but has pointed out some holes in the semantics. I got 
handwavey and, well, it shows. No cookie for me.

The compreg op should compile the passed code in the language that is 
indicated and should load that bytecode into the current interpreter. 
That means that if there are any symbols that get installed because 
someone's defined a sub then, well, they should get installed into 
the interpreter's symbol tables.

Compiled code is an interesting thing. In some cases it should return 
a sub PMC, in some cases it should execute and return a value, and in 
some cases  it should install a bunch of stuff in a symbol table and 
then return a value. These correspond to:


   eval "print 12";

   $foo = eval "sub bar{return 1;}";

   require foo.pm;

respectively. It's sort of a mixed bag, and unfortunately we can't 
count on the code doing the compilation to properly handle the 
semantics of the language being compiled. So...

Design Edict #7: the compreg opcode will execute the compiled code, 
calling in with parrot's calling conventions. If it should return 
something, then it had darned well better build it and return it.

Oh, and:

Design Edict #8: compreg is prototyped. It takes a single string and 
must return a single PMC. The compiler may cheat as need be. (No need 
to check and see if it returned a string, or an int)

Yes, this does mean that for plain assembly that we want to compile 
and return a sub ref for we need to do extra in the assembly we pass 
in. Tough, we can deal. If it was dead-simple it wouldn't be 
assembly. :)

I think that's it. Let's have at it and see where the edicts need fixing.
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk