Re: Registers vs. Stack

2003-08-22 Thread Michael Maraist
On Thursday 21 August 2003 21:40, Brent Dax wrote:
> # we're already running with a faster opcode dispatch

Man I wish I had the time to keep up with parrot development.  Though, as 
others have pointed out, the core archetecture is somewhat solidified by this 
point, I thought I'd put in my two cents worth.

I completely agree that stack machines are for wimps ;)  But I have a problem 
with some peoples representation of stack machines.  When was the last modern 
real-CPU that actually performed push/pop operations for their stack?  That 
entire argument is moot in my opinion.

Look at the sparc chip as an example.  You have a set of pre-defined directly 
mappable registers which are appended to the stack, then you have your input 
parameters, your worst-case output parameters, and your local spill 
variables; all of which are pre-calculated at compile time, then a single 
number is computed.  At the entry and exit of each function call, that number 
is added to and subtracted from the stack.  All subsequent "stack operations" 
are simply "ld/st [sp + offset], reg".  If you were balsy enough, you could 
do global variable allocation, but depending on whether you're performing 
relocatable-code, you might still have to add the address to your 
Instruction-Pointer.  Thus short of always having enough registers, you have 
to perform offset calculations, which is not much different than stack 
pushes/pops.  But the paradigm is different.

But there's another issue that I've seen brought up.  By statically allocating 
spill/input/output variables to an offset of the stack pointer, you rid 
yourself the issue of "where was that variable in the mix of pushes and 
pops".. You're garunteed that a variable is at a specific address, albeit a 
relative address.

There is no difference between performing
add R1, 5 # R1 += 5
then
add [SP+1], 5

Especially if at the opcode executing level, R1 is defined as SP+R1_OFFSET

Taking the register-spill analogy back to JITing.  We don't know how big the 
CPU register set is at parrot compile-time, so we don't know what a good 
register-set-size is.  x86's are sadly still treated as accumulators (even 
with x86-64),  there are just too many cool compiler techniques that don't 
work unless you have 32+ GPRs, so it's hardly worth the effort to test for 
possible optimizations with only 8.

On the other hand, IA-64 with 100+ GPRs can unroll loops and map temporaries 
like there's no tomorrow.

The end result is that a dynamically sized register-set is probably the ideal 
for a VM.  If the compiler can assume that you have as many registers as you 
need, but is given the constraint of "please try to not use any more than you 
absolutely need" (a la generic chaitin or Chow (basic-block based)), then in 
the rare case that an Itanium is in use, a full register mapping can occur.  
If we need to resort to accumulatoring, then you can utilize a raw 
vmStackFrame + offset, wheren vmstack is register.  It's also possible 
(albeit not as obvious) to have a hybrid of "map first n variables to 
physical registers" for the common case of 32reg machines.

Now in the case of Parrot, our stack (last I checked) was not homogeneous, so 
this simplistic case wouldn't work well.  But there are two solutions that 
immediately occur to me.  
Soln A)
Treat the datatype as trusted-opaque, and large enough to handle the largest 
data-type. e.x.
iadd R1 <= R2, R3
sconcat R4 <= R5, R6
etc..
We merely trust that the compiler won't mix and match data-types into offset 
assignments.
We would still, of course need to properly handle gc-DOD through the stack, so 
we couldn't be completely opaque.

Input parameters to functions would have to either be staticly sized, or there 
would have to be a special op-code to access dynamically-sized input 
parameters of unknown types.
A simple opcode
regAlloc(numInputRegs, numLocalRegs)
would shift the frame pointer such that numInputRegs become regs 
1..numInputRegs, and the locals become numInputRegs .. 
numInputRegs+numLocalRegs.  This is somewhat similar to the Itanium 
register-allocating style.

Soln B)
Have a multitude of homogeneous stacks.  This is identical to solution A, but 
trades complexity for performance.  Namely, there would be:
intStack
fpStack
strStack
objStack

The reg-allocation op-code would also require 4 pairs of sizes.
Additionally, the compiler must maintain 4 seperate input/output/local 
variable->register mappings.

The advantages are:
* no problems with typecasting parameter problems
* gcing is more efficient (garunteeded that all non-null refs found in str/obj 
stacks need DODing / dont need to test the stack-element-type on each 
iteration).
* more properly maps to inter/floating point register sets.. The str/obj 
stacks need external referencing anyway.


Well, again, just my $0.2.  But I just felt the need to defend "practical" 
stack computing.





Re: Interpreter memory allocation

2002-02-15 Thread Michael Maraist

On Fri, 15 Feb 2002, Alex Gough wrote:

> On Thu, 14 Feb 2002, Dan Sugalski wrote:
>
> > To allocate memory that is GCable, call Parrot_allocate(interpreter,
> > size). Then stash the pointer and size in your buffer header, or
> > it'll go missing later. To resize a chunk of memory, call
> > mem_realloc(interpreter, from, oldsize, newsize).
>
> Having to pass in the oldsize makes it very tricky to use wrappers
> around around mem_realloc which work when code isn't in parrot, is it
> not possible to have the memory pools themselves be a bit more like
> malloc/realloc in tracking allocated sizes (I imagine they need to do
> this anyway, if GC is to free chunks appropriately)?

While this is certainly possible, I believe the intent was to allow
the Parrot core to have as much control and access to the data-set as
possible.  Normally you'd have to use gc_mem_sizeof( ptr ) to get the
size.  This would encourage caching of the size value (possibly in the
local structure.. By having the size shared by both the user-defined
structure and the gc, you avoid both the function call and the wasted
memory.  This is just speculation on my part though.

>
> > If your buffer header can't be reached from the root set, you'll end
> > up having it reclaimed when a sweep is made.
>
> Which bits of a PMC count as being reachable from the root set?

Everything in any stack (including register stacks) plus PMCs shared
between threads and finally those memory chunks explicitly registered as
foreign (e.g. a manually administered reference).





Re: Configger this.

2001-12-07 Thread Michael Maraist

On Fri, 7 Dec 2001, Andy Dougherty wrote:

> On Fri, 7 Dec 2001, Bryan C. Warnock wrote:
>
> > On Friday 07 December 2001 08:43 am, Andy Dougherty wrote:
> > > Funny you should mention that, because Perl's Configure does things in
> > > order determined by 'Dependency-ish rules, a la make'.  Configure is
> > > indeed built in just the way you suggest.
>
> > Except, of course, for being one big honking file.
>
> That's a mere implementation detail :-).  (Though one that's admittedly
> quite intimidating!)  It isn't one big file until the very very end step.
> There's no reason it couldn't be a file that simply called a series of
> tiny scripts
>
>   #!/bin/sh
>   . U/intro
>   . U/find-shell
>   . U/cmdline
>   . U/find-hints
>
> etc., except that it would then be even slower than it already is.
> (Probably not a significant effect now, but it would have been when
> Configure was originally designed.)

I was thinking more along the lines of BSD-booting:

#!/bin/sh

for x in `ls U/* | sort`; do
  $x
done

Performance really shouldn't be that big of a deal, though ls/sort +
the whole shell-script nature are rather unix-centric.  A perl5 based
configure could easily be:

#!perl

opendir DH, "U";
@files = sort { $a cmp $b } grep { /^\d/ } readdir DH;
system($_) for @files;


The reason for execing, and not sourcing was to allow local variables
within each module (a-la the BSD inet.d files), but that's optional if
performance is really an issue (don't see why it should)

Now People just add conf-files and produce their own dependancy list.
Since we're comparing strings, not numbers, sub-versions work like the
library of congress..

1
2 (dep on 1)
3 (dep on 2)
25 (comes before 3)
26
255 (comes before 26)



Though that might look a bit confusing; we can always use "%02i" notation.


-Michael




Re: [PATCH classes/perlnum.pmc] Use C please, not C++.

2001-11-28 Thread Michael Maraist


While your point is taken, it's hardly considered "C++" anymore.  Many
C-compilers have adopted many such useful features.

On Wed, 28 Nov 2001, Andy Dougherty wrote:

> diff -r -u parrot-current/classes/perlnum.pmc parrot-andy/classes/perlnum.pmc

>  void set_integer (PMC* value) {
> -//SELF->vtable = &(Parrot_base_vtables[enum_class_PerlInt]);
> +   /* SELF->vtable = &(Parrot_base_vtables[enum_class_PerlInt]); */
>  SELF->cache.num_val = (FLOATVAL)value->vtable->get_integer(INTERP, value);
>  }

-Michael




mem manager direction

2001-11-26 Thread Michael Maraist

I've done a bunch of reading, and though I'm not finished, I'm starting to
look towards the following overall algorithm based on the below specified
assumptions.  I'm not _necessarily_ looking for comments at this point,
because I haven't finished evaluating the specifics of several algorithms,
but feel free to comment.

Foreward:
I'm looking at a hybrid of several algorithms, since different regions
have different advantages / disadvantages.  The typical downside of
hybriding is extra overhead in the inner-loop.  Here, the cost is in
a greater number of branches in the inner loop. You can skip to the
posted algorithm pseduo-code for a quick overview.


Since we're inlining the data-size, I'm selecting which sub-algorithm to
use based on the size.  The data-type (beit a raw character-buffer
like a string, or a nested data-structure such as an Array) will have
to be known and thus interleaved within the GC for now.  What I'd like to do is
define vtableish ops for each data-type, and have the perl-make-script
generate the core routine by concatenating the different types.
I don't want the overhead of a vtable function call, but I don't want
to expose the garbage collector to extension writers.  More to come
with this.

Small data-types ( size <= 256 Bytes):
Ideally, there's zero header-space due to the relative space-cost, and
since we're already using the size this should be possible.  For now, I'm
looking towards a generic copying-collector.  Generational schemes can
require additional header-space (either for marking or
boundry-references). The only advantage of making this generational would
be to minimize the scan-depth (most generational algorithms don't decend
through data-structures living in aged heaps), but this requires a
write-barrier, and I'm currently opposed to that (defeats much of the
advantage of abstracted garbage collection over ref-counting).  I'm still
considering this for terminal (non descending) data-types.  I figure small
data-types will be the most common by far, and so the less work done per
object the better.  Further, since each object will probably be visited on
each reclaimation stage, we don't get to take advantage of
anti-swap-thrashing techniques without imposing relatively high
out-of-band overhead.  The associated heap is grown (doubled in size)
whenever less than 33% of the heap is available after a reclaimation.
Unfortunately we don't know that we want to grow the heap until we've
finished reclaiming, and a regrowth involves another copying phase, so
it's deferred until the next collection.  I found that this isn't so bad
if we keep the max data-size much less than 33% of the heap.  I'm looking
at 64K as a starting heap-size, thus 256B is much less than 21K.

Medium data-types ( 256 < size <= page-size ):
Here we can afford header-prefixes, while copies are more expensive
(relative to other operations).  Allocations in this region can be pretty
eratic and thus hard to coallesce, so I'm looking at using a generational
scheme.  Unfortunately many such schemes "age" the wrong data, or require
a lot of extra copying.  Some of the nicer schemes only require comparison
to a water-mark (below which an object is considered aged, and above which
the object is considered young and thus likely to expire soon). Such
watermark schemes allow adaptive modification of the water-mark(s).  In
other words, when we find that most data is not dying, we want to raise
the water-mark (to avoid unneeded copying).  When we find that a lot of
aged data is dying, we want to lower the water-mark (to alleviate garbage
in the aged region).  I'm currenty fiddling with a compacting collector
here.  This requires two stages.  The first stage applies a "version" tag
prefixed to each object ( 4 byte integer (where possible)).  On each
execution of gc_reclaim, the version number is incremented.  In the
Dead-Object-Detection stage (which, in comparison, is also where small
objects are copied out to their semi-space), versions of all live objects
are updated.  Additionally, the allocated size of both the young and old
regions are calculated.  Unlike the copying-collector, we have the option
of growing the mid-sized heap before performing the copying/compaction;
providing more up-to-date and thus accurate feed-back.  If we don't copy,
then we only compact the "young" region.  Thus the advantage of aging
objects is to avoid copying them (at the expense of leaving fragments).
Alternatively, if we determine that there is too much free space in the
aged section, we can temporarily reset the water-mark and compact that
region too.  The big advantage of compaction is that the age is directly
related to how far up the heap we are; objects near the top are garunteed
to be younger than at lower levels.  The water-mark, is therefore an
accurate representation of age.  Since this heap is efficiently growable,
I'm looking to start it at 16K (or some fraction of the semi-space), since
each thread requires it's own local 

Re: Calling conventions -- easier way?

2001-10-19 Thread Michael Maraist

On Fri, 19 Oct 2001, Gregor N. Purdy wrote:

> Dan --
> FWIW, I'd rather not dedicate registers to special uses at the Parrot
> level. Jako reserves [INPS]0 for temporaries, but that is at its
> discretion, not dictated by Parrot (and I like it that way :). I was
> wishing for separate data and address stacks when I was writing the
> Jako calling convention stuff, but I can live with the rotate solution.
> If the choices come down to reserved registers or separate stacks,
> though, I'm squarely in the separate stacks camp.

Not that it matters at this stage, but if we rely on compilers for
temporaries, then what happens when we integrate different compiled
binaries?  MIPS (and I believe alpha) was good at reserved registers.
Aside from "macro'd ops" most reserved registers only reserved at
subroutine call time.  Eventually all the compilers are going to have
to deal with register renaming (we're going to run out of registers
for a lot of code, and I don't think pushing and poping the register stack is going to 
be
very helpful).

I'd be curious to read up on register renaming compiler strategies
(anybody have any online references?).  Guess the basic algorithm is
to assume an infinite number of registers, then perform data
dependency checks and reduce the tree.  In my up and comming post, I
suggest a "peek R, i|ic" op-code which allows you to store fundamental
datatypes in subroutine local stack memory (like hardware CPUs).  If
all variables are mapped within the stack frame, then register
renaming should merely consist of pushing unused valued to their
associated frame-offset and reusing that register and extracting it's
replacement value from another address.

-Michael





Re: Calling conventions -- easier way?

2001-10-19 Thread Michael Maraist

On Fri, 19 Oct 2001, Dan Sugalski wrote:

> At 01:24 PM 10/19/2001 -0400, Gregor N. Purdy wrote:
> >James --
> >
> >Should we have bsr(i|ic, i|ic), that jumps to $1, with the return
> >address below the $2 arguments? Similarly, should we have ret(i|ic),
> >that rotates the return address out from under $1 arguments and then
> >returns to it?
>
> Nope. For a number of reasons, the biggest being that we're not returning
> our return values on the stack. :)
>
> > > OTOH, we could keep our current ABI, and pop the return address into an I
> > > register, and then push it and ret, or jmp to it.
> >
> >The return address is a pointer, which doesn't necessarily fit in an IV,
> >I believe. I played with doing this first, but didn't like having to use
> >a register for it. If we wanted registers to be part of the calling
> >convention, we should just say so, but again we run into the question of
> >which type of register to use.
>
> I'm currently leaning either towards returning values in registers
> [PSIN][0-4] with the total number of each type in register I0 somehow, or
> as arg stack that's separate from the call stack. The latter is probably
> the better way to go, but it's got more overhead in some ways. It'd make
> interfacing with RPN languages easier, though.
>

I was in the middle of writing up something on this topic.. I'm still
working on the details, so I'll summarize.

The more out-of-band data the better.. It's faster (no munging an all
in one stack), more efficient (don't have to abstract it's type),
and we're not limited like some hardware is (e.g. having a single stack
segment).

Most of my bias is based on a paper on stack-based machines presented
by someone else earlier in this mailing list.  It was a good paper,
though somewhat dated.

The only problem with having out-of-band return addresses is syncing
up with the stack (unrolling it for exceptions of frame-analysis). I
was working on a concept that used segregated stack-frames for the
parameter stack, with the stack frame header referencing the
"top-of-return-address stack".  Alternatively, the
return-address-stack could contain a reference into the parameter
stack(s).  The latter would be easier for unrolling, but makes the
pushed value a little more complex.

I'm not particular on the details, though I'm still working on them,
but I definately think out-of-line return addresses are "good
thing"(tm).

-Michael




Re: A warning on threads...

2001-10-15 Thread Michael Maraist



> I'm about to start welding in support for multiple interpreters, running
> both serially and simultaneously, into Parrot. (With provisions for
> starting and coordinating interpreters from other interpreters)
>
> This is just a heads-up, since it probably means platforms without POSIX
> thread support will need to provide some workarounds. (I'll be putting
> together a generic low-level thread interface with stub routines/#defines
> to make things easier) FWIW, I do *not* plan on supporting POSIX d4
> threads--final draft or nothin'. (Or, rather, final draft or someone else
> writes the wrappers...)
>
> Whether threads of some sort will be required for Parrot's up in the air--I
> want to wait and see what sort of performance impact there is before making
> that decision.
>
>   Dan
>

Well, the memory allocator is most defaintely affected by having
threading enabled at compile time.  The default GNU (glibc) memory allocator
assumes threading, as I'm sure the Solaris one does, so traditional
builds (a-la RedHat) are not going to be affected (assuming there is a
"use-the-OS-malloc" configuration flag like in perl5).  The simple
allocators (like perl5's) just use a big lock, which just adds a fixed
overhead for both threaded and non-threaded (though hurting
scalability).  But glib, and the
Solaris magazine-architecture (proposed in the earlier email) require
a redesign which adds a tremendous amount of overhead for single
threading (though they scale nicely).  I've been avidly reading up on
memory allocators and writing my version of a magazine allocator.  I
deviate substantially from the SUN paper, given that they suggest just
using sbrk and mmap for anything below the slab (since they're
trusting that the vmem architecture in the kernel is fast enough).
This clearly isn't acceptible on non Solaris machines who may be
horribly slow at mmap.  Additionally, I'm taking advantage of the
per-thread-interpreter, which avoids having to do ANY locking for the
simplest case of small allocs/frees. More complex cases such as
larger allocs which are too sensative to unused memory trade off
locking contention for free space. The largest allocs (as with most
memory schemes) are handled with simple low-level access (sbrk / mmap).

My current incarnation is very module, which means that there are lots
of function calls in the worst case.  Once I debug it fully, I'll look
into consolidating everything into one large function call which
should avoid some redundant locking and further speed things up.  The
bad part about SUN's paper is that their benchmarks are hog-wash..
One of them shows performance when you use alloc/dealloc pairs of a
fixed size.  This is the optimal case for their allocation scheme,
since it only invoves a handful of assembly instructions in highly
CPU-local cached memory allocations.  My benchmark takes a rather
large array and randomly chooses cells within the array.  If the
cell-value is null, it makes a randomly sized allocation, if it's
filled, it frees it.  This at least simulates multi-sized,
multi-lifetimed operations.  My multi-threaded simultator simply utilizes
n-thread-local-arrays each of size max_array_size/n.


Assuming that we utilize one interpreter per thread, then there will
be a significantly reduced need for locking (and CPU cache sharing on
multi-CPU architectures).  However, when we "require" multithreading,
there is the temptation of handing off functionality like garbage collection
to their own threads.  MT is almost always slower (more total cpu
time) / memory hogging than ST. (Though obviously less real time can
pass with the rare case of multiple CPUs)  Additionally, I'd argue
that most apps are single-threaded in design, and wouldn't value the
loss of performance tradeoff for threaded-functionality.

Perl5 runs measurably slower when built for threading (even with only
a single thread). And I believe that's mostly due to the extra pthread function
calls.  Granted Solaris has a nice and fast proprietary MT library,
though it doesn't do the general (platform independent) case any good.

I think it's fair to assume no additional logic / locking occurs
within the op-codes (instead locks are relegated to specialized API
functions that an op-code might eventually call).  So depending on how
the GC (and memory allocation in general) handles MT compilation, we
shouldn't see too bad of a slow-down.  But I still would like the
option to have a non-MT compiled core for "hell-bent-execution", which
I've regularly found to be useful (operating on multi-meg data-sets
which can take half an hour or more to process).

-Michael




Re: Fetching the PC?

2001-10-12 Thread Michael Maraist

On Fri, 12 Oct 2001, Dan Sugalski wrote:

> At 01:30 PM 10/12/2001 -0400, Michael Maraist wrote:
> >I'm of the opinion that you shouldn't just be able to jump into
> >another code-segment.
>
> [Snip]
>
> >And thus be prevented from changing context; That would be relegated
> >to a subroutine invocation (especially since such subroutines can be
> >dynamically modified).
>
> Ummm... what do you think we're using to call into subroutines? Or just do
> global control-flow changes?

True, even a supposed callCV or gotoCV would have to return something valid.
My current impression of the direction was to utilize callCV in a
manner similar to "exception handling", which the last time I heard
would look something like this:

  ne I1,I2, good
  setException ...
  end
good:

It stood to reason that callCV could do the same thing.  Considering
for a moment that the direction isn't completely different, the way I
see this is as:


// usage:
//   callCV P5
//   end
//   fooOP // utilized on return
AUTO_OP callCV_p {
  PMC cv = PMC_REG(P1);

  if ( !isCV(cv) ) {
EXCEPTION(.)
RETURN // auto-next-op, which is end (handles exception)
  }

  // excluded in gotoCV
  push( interp->return_stack, code + CUR_OP_SIZE + END_OP_SIZE );

  if ( isCVInCurContext(cv, interp) ) {
// set up local return value
RETURN(getPCFromCV(cv)); // jump
  } else {
interp->next_pc = getPCFromCV(cv);
// further setup return value
RETURN // auto-next-op, which is end, (does an inter-module jump)
  }
}

The advantage is that any context/scope cached values in the do-loop can be
flushed on a context-switch.  To minimize the overhead (of a c-call)
we optimize for intra-context calls; what-ever that might entail; I'm
guessing intra-module calls.

Well, right off the bat, I admit that this method is paradoxical,
since PMC->op is from a vtable, separate from op-codes.  Such a vtable
function handler can't just call into the op-code-table, so things get
more complex anyway.  None the less, this is the basic idea I had in mind.

-Michael




Re: Fetching the PC?

2001-10-12 Thread Michael Maraist

On Fri, 12 Oct 2001, Ritz Daniel wrote:

> > >i fixed that. but ther's only a jump_i, no jump_ic...
> > >
> > >"jump Ix" now jumps to the absolute address held in a register Ix.
> > >Absolute means
> > >start of the bytecode + Ix.
> >
> > It can't mean that. This:
> >
> >set I0, 0
> >jump I0
> >
> > should coredump.
> >
>
> i think parrot is a virtual cpu so the word absolute should be seen in context
> of the vm, not the real cpu we are running on.
> within the vm address 0 should be address 0 of the bytecode, not the
> real cpu. but it would be nice to have a null pointerso what about the first
> instruction in bytecode is at vm address 1? so a
>   set I0, 1
>   jump I0
> would be a program restart and 0 is the null pointer for which we can check
> with a conditional branch...and a jump to address 0 results in a crash
> (it does so on my windoze).
>
> but the best solution in my opinion would be:
> first instruction in bytecode is a 'end' (at address 0), then the program itself
> starts at address 1. a jump to address 1 is a program restart, we can check
> for null pointer with conditional branch and a jump to address 0 would be an
> immediate program end (it doesn't look nice if an interpreter core dumps). or
> better than an 'end' is an opcode that throws some error messages, ends the
> program but does not core dump.

I started out with an annoying question, but then managed to answer it
myself.  Still I think I can clear up some confusion as to the
practicals.

What you are suggesting basically is:

unsigned int PC = interp->pc;
int* code = interp->cur_code_segment;
optable_t* optable = interp->optable;
while(code[PC]) { PC = optable[code[PC]]->(code, PC, interp ) }

A bounded version of the op-loop would simply be:

while(PC < max_code_size &&  code[PC] ) DO_OP..

Since we're zero-bounded (due to unsigned int)

This requires an extra level of indirection in most op-codes as well
as in the main loop (not to mention the marginal overhead of an extra
parameter).

The assembler would have to do:
s/ P(\d) / code[PC + $1] /x;

instead of just
s/ P(\d) / code[ $1 ] /x;

I'm of the opinion that you shouldn't just be able to jump into
another code-segment.  That the interpreter core should be manipulated
(to change the context, such as bounds).  In either code[PC] or *code,
the jump is still:
AUTO_OP jump_i {
  return(INT_REG(P1));
}

And thus be prevented from changing context; That would be relegated
to a subroutine invocation (especially since such subroutines can be
dynamically modified).  I'll have to see some real subroutine implementatinos
before I can support this method though.

I'm curious to see if gcc -O2 can alleviate the over-head of *(code + PC
+ offset) for the parameters.  For constant offsets, the x86 does a
good job (at least at the assembly level; I know that the micro-ops
still requires an extra add, but they may just be using a 3-way-add).


In any case, I think we're all in agreement about not remapping the
physical C-addresses, but for completeness I'll give the reasoning.

Given that modulaA will have interp->code range from say 28M to 28.4M, moduleB
will have interp->code range from 41.4M to 42.2M, etc - Where-ever mmap
assigns the address.  It would therefore be almost impossible to map PC to
a linear physical address.

Obviously PC can't be a contiguous zero-based address, since it's not code[PC],
it's PC = cur_code + rel_branch, DO_OP( PC ).  Physical indexed jumps,
therefore are meaningless.  "jump 500" where 500 is a
compile-time-constant is really trying to say jump interp->code_base[ 500
], not PC=500, DO_OP( PC ), since that would be a core-dump (this is
c-memory address 500, which is off limits).  But
"jump interp->code_base[ 500 ]"
is physically no more benificial than
"branch PC - label";

Jump-register can be useful for certain optimizations (i.e. switch).

Note that my original take was that we should either have
  while(code) DO_OP
or
  while(code && *code) DO_OP

as the fast-do-loop, since:

  set I1, 0
  jump I1
would just act like exit. Then I thought about:

  set I1, 1
  jump I1

Which would most definately core-dump.  Thus we're not really much safter
checking code's address than any other special value.

The benifits of while(code[PC]) are potentially outweighed by the
overhead of code[PC + arg_offset].  Thus I'm mostly inclined to always support
the current "safe" method ("while(code>start && code
>
> -daniel
>
> > >the following code will result in a simple program restart,
> > >no core dump:
> > > set I1, 0
> > > jump I1
> > >
> > >the fixed jump breaks the tests: basic5, call.pasm, jump.pasm
> > >but i wonder why nobody realized that jump's broken, the doc says it jumps
> > >_to_
> > >the address, not forward or backward
> >
> > That was brought up a while ago, but I don't think anyone's had time to put
> > a patch in. I'm working on stack and jsr support, so I'll fix it then.
> >
> > Dan

-Michael




Re: rand/srand patch

2001-10-12 Thread Michael Maraist



> At 08:48 PM 10/11/2001 -0400, Ryan O'Neil wrote:
> >I was playing with Parrot and wanted a basic random numbers
> >implementation.  Just in case anyone else wants it too, here are the
> >appropriate diffs and test file.  It seemed logical for rand to return a
> >real number between 0 and 1 instead of having any reliance on
> >RAND_MAX.  Any other ideas?
>
> I'm not 100% sure I want random number generation to be a fundamental part
> of the interpreter. (Though it *is* really appealing...) This sort of thing
> might be better put in the standard library rather than built in as a core
> opcode.
>
> Nifty, though. :)

My take on this follows from Larry's attempt to "Huffman encode" the
language.  What do we use most often, and how efficient should that stuff
be?  It seems that a log-string op-code could be very useful to module
developers; if they can't get system modules to load properly than a
garunteed log op-code would be invaluable.  Alternatively, something like
time and rand are quick-little functions that are very useful in inner
loops.  Profiling and optimized loops would really love to have them.  The
only negative I see is in the compiler complexity (the CISC v.s. RISC
argument).  I don't know how much time a system-library call will take
(meaning this point might be moot if it's fast enough), plus time and rand
are probably good candiates for over-loading which would require
higher-than-op-code status.

Oh well.

-Michael




Re: [PATCH] Big patch to have DO_OP as optional switch() statment

2001-10-09 Thread Michael Maraist

> void
> runloop(int**code) {
> static void* array[] = { &&op_end, &&op1, &&op2 };
>
> start:
>   goto *array[*code];
>
> op1:
>   foo; goto start;
>
> op2:
>   foo; goto start;
>
> op_end:
>   return;
> } // end code
>  w/ code=1000, loop_cnt = 5,000,000
> gcc -O3 speeds.c
> --
> switcher time delta=195
> caller time delta=268
> jumper time delta=185
> code compile time delta=0
> fast jumper time delta=154
>

I saw another post on the improvements provided by the simple
modification of:

goto *array[*code];

lbl1:
  # do stuff
  goto *array[*code];
lbl2:
  # do stuff
  goto *array[*code];
...
end:
  return;

And they were right.  I received MASSIVE performance boosts.

gcc -O3 w/ 1000 ops and 5,000,000 loops
jumper code = 31/32 seconds
fast jumper = 32/31 seconds

gcc -O3 w/ 50,000,000 ops and 10 loops
jumper code = 14 seconds
fast jumper = 13 seconds


This is a FULL 500% faster than the fastest above code.  Note also
that the "fast jumper" is only on par with the portable version.  (The
fast jumper modified the source-code to insert physical addresses
instead of using indexes to the array).

Two possible reasons for this.  One is cache-related issues, but I
highly doubt it because even the 50Meg (non-cachable) code was
insanely fast (thanks mostly due to predictable pre-caching).  The
reduction of a branch is most likely the reason.  Previously the
instruction until a branch was like 2 so the pipeline couldn't be
utilized.  Note that more complex instructions wouldn't help; they'd
just be more efficient.  What strikes me as odd though is that I used
a constant jump (goto start).  The Pentium Pro line is supposed to be
able to handle this efficiently.

The next thing to test is having 100 core op-codes.  Then it'll be
interesting to see how horrible adding the conditionals are.
Theoretically we don't need any conditionals.  "change_state" op-codes
can simply return NULL just like end does, thereby breaking out of the
loop automatically.

I never understood the reasoning behind while(*code) instead of just
while(code) (which is properly handled here).  Obviously there's value
to while( code > start && code < end ), but that's in the non-hell-bent loop.

Oh by the way.

500M ops in 15 seconds is 1.02 Billion parrot ops / second.  This isn't
too shabby on a mere 466MHZ machine.  That's a scant 2.18 parrot ops per
cycle.  Getting pretty close to the theoretical max that architecture can
perform.  It ran even faster with the non-cached memory (200Meg sys-ram),
than with short cachable bursts but that's mostly because the prefetcher
was always correct.  Note that this was only because there were only 3
assembly instructions per op-code, and all the op-codes were of the same
form (inc, get-next-op, jump).  Pure-mathematical real-parrot-ops should
get pretty close to this, but most parrot-ops will have evil conditionals
that thwart such performance.

-Michael




Re: [PATCH] Big patch to have DO_OP as optional switch() statment

2001-10-09 Thread Michael Maraist

On Tue, 9 Oct 2001, Dan Sugalski wrote:

> At 01:20 PM 10/9/2001 +0200, Paolo Molaro wrote:
> >On 10/07/01 Bryan C. Warnock wrote:
> > > while (*pc) {
> > > switch (*pc) {
> > > }
> > > }
> If anyone wants an official ruling...
>
> DO_OP can contain any code you like as long as it:
>
> *) Some version ultimately compiles everywhere
> *) Allows opcodes above a certain point (probably 256 or 512) to be
> overridden at runtime
>
> Computed gotos, switches, function table dispatches, generated machine
> code, or some other form of coding madness are all OK. I don't care which
> way a particular platform goes as long as it doesn't constrain another
> platform's ability to go another way.

Hehe.  Ok, I delved into the particulars of gcc and found:

void
runloop(int**code) {
static void* array[] = { &&op_end, &&op1, &&op2 };

start:
  goto *array[*code];

op1:
  foo; goto start;

op2:
  foo; goto start;

op_end:
  return;
} // end code

The key was &&lbl which returns the physical address of the a jumpable
label.  This is highly gcc-specific, but given that we can use
#ifdef's I don't see a problem with it. :)

I even wrote an ugly version that avoided that array indirection.
What it did was "convert" the code-base to change the op-codes into
physical addresses.  This obviously has some negative ramifications,
but is definately doable.  By removing that extra indirection I got a
couple percent extra improvement (12%).  I can't imagine you'll get much
faster than that (except for an inlining jit or perl6->C compiler).
The only problem was that the &&lbl's didn't want to be non-local, so
I had to perform some evil magic.

The unpolished code is attached with a cheesy instruction set which
was just designed to prevent the optimizer from optimizing away my
op-codes.

Here are the benchmarks on a 466MHZ dual proc celeron:


w/ code=500,000,000 and loop_cnt = 10
(cache-bound)

gcc speeds.c
-
switcher time delta=45
caller time delta=47
jumper time delta=43
code compile time delta=4
fast jumper time delta=34

gcc -O speeds.c
-
switcher time delta=30
caller time delta=41
jumper time delta=28
code compile time delta=3
fast jumper time delta=26

gcc -O2 speeds.c
---
switcher time delta=30
caller time delta=41
jumper time delta=28
code compile time delta=3
fast jumper time delta=25

gcc -O3 speeds.c

switcher time delta=29
caller time delta=41
jumper time delta=28
code compile time delta=2
fast jumper time delta=26

==
==

 w/ code=1000, loop_cnt = 5,000,000
gcc -O3 speeds.c
--
switcher time delta=195
caller time delta=268
jumper time delta=185
code compile time delta=0
fast jumper time delta=154

# Note a marginal improvement from switch -> jump, yet an even bigger
improvement in jump to fast-jump.

It's unlikely to have a 50M (x4B+) code-base. But it's possible to
have an inner loop that has 500,000 iterations.

-Michael


#include
#include
#include
#include 
#include 
#include   
#include  
#include 


//#define SIZE 5000
#define SIZE 1000
int codep[SIZE];

void
dump_mappings()
{
  char strbuf[1024]; 
  int pid = getpid();
  sprintf( strbuf, "cat /proc/%i/maps", pid  );
  system( strbuf );

  sprintf( strbuf, "ls -l /proc/%i/fd" );
  system( strbuf );

  printf("press enter to continue");
  scanf("%c",strbuf);
 
}

int x,y,z;

void
switcher( int*code)
{
  while(1)
switch(*code) {
case 0: goto DONE;
  break;
case 1: x++;
  code += 1;
  break;
case 2: y++;
  code += 1;
  break;
case 3: z++;
  code += 1;
  break;
case 4: x = y + z;
  code += 1;
  break;
}
 DONE:
} // end switcher



int* end(int*code) { return 0; }
int* o1(int*code) { x++; return code+1; }
int* o2(int*code) { y++; return code+1; }
int* o3(int*code) { z++; return code+1; }
int* o4(int*code) { x = y + z; return code+1; }


typedef int* (*op_handler_t)(int*code);

op_handler_t op_codes[] = {
  end, o1, o2, o3, o4
};

void
caller( int*code)
{
  while(*code)
code = op_codes[ *code ](code);
} // end caller

void
gotoer( int*code)
{
  static int* array[] = { &&lend, &&lo1, &&lo2, &&lo3, &&lo4 };


 start:
  /*
  if ( !code ) {
puts("null code");
exit(-1);
  }
  if ( *code > 3 ) {
puts("Invalid idx");
exit(-1);
  }
  */
  goto *array[ *code ];

  lo1:
x++;
code += 1;
goto start;
  lo2:
y++;
code += 1;
goto start;
  lo3:
z++;
code += 1;
goto start;
  lo4:
z++;
code += 1;
goto start;

 lend:
return;
} // end gotoer

int* 
convert( int size, int*code, int*map )
{
  int* retval, *base;
  int idx =0;
  //dump_mappings();
  retval = base = (int*)malloc(size * sizeof(int));
  if ( !retval ) {
puts("out of memory");
exit(-1);
  }
  //dump_mappings();
  
  //printf("retval=%p, codep=%p\n", retval, codep );
  /*
  if (!retval) {
puts("couldn't allo

Re: [PATCH] content preserving register pushes

2001-10-08 Thread Michael Maraist


> Currently, instead of pushing the contents of fixed registers onto the
> stack, the registers themselves float on top of the stack.  That doesn't
> preserve register contents on a push_x, as the registers are now in a new
> memory location.  After toying with fixed registers per interpreter, I
> decided that it probably wasn't worth it in the long haul.  (Since it's
> doubtful, given the size of our register sets, that this would need to be
> done often.)  Instead, I'm proposing a parallel set of data-preserving
> push_x ops, save_x.  Patch after .sig.
>

The biggest problem that you have with save_x is how to retrieve values when
you do a pop_x.  Most any nested block / subroutine call needs to return at
least one value.  One possibility is via a consolidate op-code which is just
like pop but accepts a single register to preserve.

AUTO_OP preserve_i {
  int tmp = INT_REG(P1);
  Parrot_pop_i(interp);
  INT_REG(P1) = tmp;
}

But that only works when you have exactly 1 preserved reg; not always the
case. Especially with nested blocks of code which might want to modify
several of it's enclosed scope-variables.

My point is that by itself save_i is useless except for possible
side-effects (such as outputting).

I've suggested before a SPARC-like register window.  The relavent part here
is that you can do a partial push / pop of half the register register-set.
This can be thought of as dividing the register set into Input / Output
registers.  A given scope uses all registers.  But when it comes time to
enter a new scope they have to make sure that all output parameters are in
the upper half, then perform a half-push.  When leaving a scope, they have
to make sure that all returnable parameters are in the lower half, then
perform a half-pop.  One advantage is that this is faster than a save_i
instruction since only half the regs are saved.  Additionally we're not
duplicating ALL the registers, just those that we're not using in the nested
scope.  Saves (num_regs/2)*sizeof(reg) memory on each new scope (including
function calls).

The only complex change this would require in the existing code-base is that
register sets couldn't just be linked-lists.  They'd have to exist as arenas
of multiple contiguous reg-sets.  The current reg set would consist of an
arena pointer and an offset into that arena (cached by the reg-set pointer).
If you're at the end of a memory arena, for example, a half-push would
require allocating a new arena with num_regs*num_frames_in_arena then
copying num_regs/2 registers from the top of the old arena to the bottom of
the new arena since a register set has to be contiguous.  Full pushes
shouldn't be affected; if there is only a half-frame left in the current
arena, then it can just skip it and allocate a new arena.

Half-pushes for subroutines are great since you can optimize small
subroutines to work within the limited free "out-regs".  So the following
code:
sub foo($x : int, $y : int ) : int # equiv to C "int foo(int,int)"
{
  return  $x + $y
}
$::z = foo($::x, $::y);

can be:
foo:
  add I16, I16, I17
  return

main:
 ld I16, [x]
 ld I17, [y]
 call foo
 st [z], I16

Additionally:
sub bar($x : int ) : int
{
  return foo( $x, 5 );
}
$::z = bar($::x)

becomes:
bar:
  half_push_i # I16 is now I1, I17 is now I2
  mov I16, I1 # I1 -> I16
  set I17, 5
  call foo
  mov I1, I16 # save output of foo to our output reg
  return
main:
  ld I16, [x]
  call foo
  st [z], I16

Note that in this case we could have additionally optimized bar to:

bar:
  set I17, 5 # append to parameter stack
  call foo
  return

Note this only works when the parameter numbers are known.  We'd still have
to resort to something else for dynamic parameters.  I'm partial to an
external general-purpose stack.  For anonymous invocation

$dispatcher{ foo => \&foo )

$val =<>;
@params = <>;
$dispatcher{ $val }->( @params );

You'd have to have a generic wrapper function that performed bounds-checking
for the prototype.

foo_generic:
  GetStackFrameSize I16 # SP-FP -> I16
  ne I16, 2, foo_bad_num # invalid num of parameters
  POPStack I16 # throws exception if not type int
  POPStack I17


  call foo # hopefully storing on separate return stack since it's not
   # useful as a register??? Else we need to save reg-stack
  PushStack I16
  return

foo_bad_num:
  exception "invalid num params"
  end
  return

main:
  push_p
  push_s
  PushStackFrame

  # $dispatcher{ foo => \&foo )
  newHash P1
  ld P2, ["foo_generic"]  # CV
  SetHash P1, "foo", P2

  # $val =<>;
  readln S1, 0 # read stdin

  # @params = <>;
  newArray P3
  readArray P3, 0 # gulp rest into @P3 (@params)

  HashFetch  P4, P1, S1  # P4 is presumably CV

  PushStackFlatten P3 # takes each el from P3 and puts on stack
  call foo_generic # assumes X16..X32 are volitile
  PopStackFrame

##

I had originally wanted a complete emulation of the SPARC reg-stack, since
it was very elegant.  You'd have 2 separate register stacks.  You'd what
I've des

Re: vtable.h

2001-10-06 Thread Michael Maraist

On Sat, 6 Oct 2001, Michael Maraist wrote:

> My question at this point is if the PMC's are polymorphic like Perl5
> or if there is an explicit type type.  Polymorphics can make for vary
> large vtable sub-arrays. (int, int_float, int_float_string,
> int_string, etc).
>
> If PMC-types are bit-masked (for easy upgrading) such as:
> O   O O   O
> ^   ^ ^   
> |   | |   ...
> INT FLOAT STR
>
> We could apply a macro that extract the desired type.
> Such as GET_PMC_TYPE_INT(Px) which if it is of type int, it returns
> int, else float else string.
>
> #define GET_PMC_TYPE_INT(type) (type & PMC_INT)?PMC_INT : (type &
> PMC_FLOAT)?PMC_FLOAT : (type & PMC_STRING)?PMC_STRING : type
>
> Likewise GET_PMC_TYPE_FLOAT would return first float then int then string
>
> It's not as fast because we're not avoiding the nested if-statements,
> but it's easy enough to read.
>
> P2->vtable->add[ GET_PMC_TYPE_INT(P3->type) ]->(...)

Ops!  Stupid me.  I forgot that at the add_p_p_p level we don't know
that it's an INT / FLOAT, etc.

The only way that we could use bit-masked types is if we wrote a
complex if statement:
 ( P2 & PMC_INT )? (P3 & PMC_INT ? PMC_INT : (P3 & PMC_FLOAT ?
 PMC_FLOAT : PMC_STR ) :
   (P2 & PMC_FLOAT)? (P3 & PMC_FLOAT ? PMC_FLOAT : ( P3 & PMC_INT ) ?
   PMC_INT : PMC_STR ) :
   (P2 & PMC_STR)? ( P3 & PMC_STR ? PMC_STR : 

This is beyond ugly, not to mention not upgradable if/when we add new types.

So unless we're using an enum-style type-compaction doubly indirected
vtables won't be feasible.

>
> Ideally, the bit-pattern for the pmc-type is numerically small (for
> small sub-arrays).
>
> enum PMC_TYPES { PMC_INT, PMC_FLOAT, PMC_STR, PMC_INT_FLOAT,
> PMC_INT_STR, PMC_INT_FLOAT_STR,
>  PMC_FLOAT, ... };
>
> In this way we simply map everything that has INT in it to the same
> handler.  No conditionals at all (but lots and lots of vtable space).
> Thankfully this is constant and could be assigned globally such that
> there is no intialization overhead.
>
> -Michael
>

-Michael




Re: vtable.h

2001-10-06 Thread Michael Maraist

On Sat, 6 Oct 2001, Simon Cozens wrote:

> On Sat, Oct 06, 2001 at 09:01:34AM -0500, Gibbs Tanton - tgibbs wrote:
> > Also, how will adds of different types be handled.  In the above if pmc2 is
> > an int and pmc3 is a float we're going to have to know that and do a switch
> > or something to convert to/create the right type.
>
> There'll actually (and I need to change my vtable code to reflect this) be
> several versions of each vtable function, depending on the relative type of
> each PMC. Basically, there'll be two easily optimizable versions (i.e. types
> are the same, or types can be easily converted with a cast or simple function)
> and a non-optimized version, which would actually be the naive implementation
> in many cases. ("These types are way out of my depth - call ->get_integer on
> each one, and add the result.")

So would it be something like(ultimtaely put into a macro):
AUTO_OP add_p_p_p {
  if (!P1)
CREATE_PMC(P1);
  if (!P2 || !P3)
throw exception; // however this is done in Parrot
  P2->vtable->add[P3->type]->(interp, P1, P2, P3); //in macro
}

In this way each vtable operation is really an array of handlers for
each possible type of input.  This avoids any comparisons.  Invalid
comparisons all share a common function (which throws an "invalid data-intermingling"
exception).

int_pmc_vtable = { ., { pmc_vtable_add_int_int,
pmc_vtable_add_int_float, pmc_vtable_add_int_string,
pmc_vtable_add_int_iconst, pmc_vtable_add_int_fconst, ... }, ... };


// maps to "RET_DT pmc_vtable_add_int_int(interp_t*, PMC*, PMC*, PMC*)
AUTO_VOP add_int_int
{
  UPGRADE(P1,PMC_INT);
  P1->ival = P2->ival + P3->ival;
}

AUTO_VOP add_int_float
{
  UPGRADE(P1,PMC_INT);
  P1->ival = P2->ival + P3->fval;
}

AUTO_VOP add_int_iconst
{
  UPGRADE(P1,PMC_INT);
  P1->ival = P2->ival + P3;
}

AUTO_VOP add_int_fconst
{
  UPGRADE(P1,PMC_INT);
  P1->ival = P2->ival + Parrot_float_constants[P3];
}

AUTO_VOP add_int_string
{
  UPGRADE(P3,PMC_INT);
  UPGRADE(P1,PMC_INT);
  P1->ival = P2->ival + P3->ival;
}

Alternatively, if we can't be both a string AND an int in PMC:
AUTO_VOP add_int_string
{
  int p3ival = PMC_STR_TO_INT(P3);
  UPGRADE(P1,PMC_INT);
  P1->ival = P2->ival + p3ival;
}



This assumes that a = b op c will be the same as a = b.op( c ), which
I think is fair.  Thus add_float_int produces a float while
add_int_float produces an int.  The compiler can worry about the order
or the parameters.

I don't think there's much value in writing separate a op= b since you
could just do:

P1->vtable->add[P1->type]->(interp, P1, P1, P2);

With hardly any additional overhead.  The "optimized" code might have
been:

AUTO_VOP inc_int_int
{
  P1->ival += P2->ival; // avoids casting P1
}

But now you have LOTS more vtable ops.

My question at this point is if the PMC's are polymorphic like Perl5
or if there is an explicit type type.  Polymorphics can make for vary
large vtable sub-arrays. (int, int_float, int_float_string,
int_string, etc).

If PMC-types are bit-masked (for easy upgrading) such as:
O   O O   O
^   ^ ^   
|   | |   ...
INT FLOAT STR

We could apply a macro that extract the desired type.
Such as GET_PMC_TYPE_INT(Px) which if it is of type int, it returns
int, else float else string.

#define GET_PMC_TYPE_INT(type) (type & PMC_INT)?PMC_INT : (type &
PMC_FLOAT)?PMC_FLOAT : (type & PMC_STRING)?PMC_STRING : type

Likewise GET_PMC_TYPE_FLOAT would return first float then int then string

It's not as fast because we're not avoiding the nested if-statements,
but it's easy enough to read.

P2->vtable->add[ GET_PMC_TYPE_INT(P3->type) ]->(...)

Ideally, the bit-pattern for the pmc-type is numerically small (for
small sub-arrays).

enum PMC_TYPES { PMC_INT, PMC_FLOAT, PMC_STR, PMC_INT_FLOAT,
PMC_INT_STR, PMC_INT_FLOAT_STR,
 PMC_FLOAT, ... };

In this way we simply map everything that has INT in it to the same
handler.  No conditionals at all (but lots and lots of vtable space).
Thankfully this is constant and could be assigned globally such that
there is no intialization overhead.

-Michael




Re: Okay, hardware gurus....

2001-10-03 Thread Michael Maraist


> > Linux/Athlon/gcc.
> >
> > Why does changing this:  (DO_OP loop partially inlined)
> >
> > while (pc >= code_start && pc < code_end && *pc) {
> > do {
> > x = z->opcode_funcs; \
> > y = x[*w]; \
> >w = (y)(w,z); \
> > } while (0);
> > }
> >
> > to
> >
> > x = z->opcode_funcs;
> >
> > while (pc >= code_start && pc < code_end && *pc) {
> > do {
> > y = x[*w]; \
> >w = (y)(w,z); \
> > } while (0);
> > }
> >
> > slow it down by 6%?
>
> Perhaps x is no long considered a register.  Are you compiling with -O2?
> compile to source (-S on gcc) and loop for that inner loop.  See what is
> physically different.  (I'm away from my test-environment at the moment)
>

Oh, better yet, given all the cache-work I've done in the past two days,
it's entirely possible that a cache line is violated due to the new
assembly ordering.  You'd have to check the assembly source to know.

One check you can perform is to put a temporary variable w/in the loop and
assign it to the outer variable.  This avoids the indirection but keeps
the cache / register set populated.

If you did you -O2 it shouldn't matter though, because the code would be
reordered either way.

-Michael




Re: Okay, hardware gurus....

2001-10-03 Thread Michael Maraist

> Linux/Athlon/gcc.
>
> Why does changing this:  (DO_OP loop partially inlined)
>
> while (pc >= code_start && pc < code_end && *pc) {
> do {
> x = z->opcode_funcs; \
> y = x[*w]; \
>w = (y)(w,z); \
> } while (0);
> }
>
> to
>
> x = z->opcode_funcs;
>
> while (pc >= code_start && pc < code_end && *pc) {
> do {
> y = x[*w]; \
>w = (y)(w,z); \
> } while (0);
> }
>
> slow it down by 6%?

Perhaps x is no long considered a register.  Are you compiling with -O2?
compile to source (-S on gcc) and loop for that inner loop.  See what is
physically different.  (I'm away from my test-environment at the moment)

-Michael




RE: thread vs signal

2001-10-01 Thread Michael Maraist

On Sun, 30 Sep 2001, Hong Zhang wrote:

> > >How does python handle MT?
> >
> > Honestly? Really, really badly, at least from a performance point of view.
> > There's a single global lock and anything that might affect shared state
> > anywhere grabs it.
>
> One way to reduce sync overhead is to make more operation atomic instead
> of of sync. For example, read() write() are atomic. There is no need to
> sync stream. The array get/put are atomic in Java, so we don't need sync
> either. The high level library or app itself will be responsible for the
> sync.

Now how do you go about performing an atomic operation in MT?  I
understand the desire for reentrance via the exclusive use of local
variables, but I'm not quite sure how you can enforce this when many
operations are on shared data (manipulating elements of the
interpreter / global variables).

I definately agree that locking should be at a high level (let them core
if they don't obey design it well).  I liked the perl5 idea that any
scalar / array / hash could be a mutex.  Prevents you from having to
carry around lots of extra mutex-values.  We can achieve the exact
same "synchronization" policy of java or one that's finer tuned for performance.


-Michael




Re: thread vs signal

2001-09-30 Thread Michael Maraist

> Dan Sugalski wrote:
>
> > >How does python handle MT?
> >
> > Honestly? Really, really badly, at least from a performance point of view.
> > There's a single global lock and anything that might affect shared state
> > anywhere grabs it.
>
> i.e. not so much 'threaded' as 'stitched up'.

Well, python never advertised speed.  Threading is just a clean way of
handling certain types of problems.  Sounds to me like they simply valued
simplicity and stability above other things.  So the word "bad" is
obviously subjective.  I've definately heard people praise their extension
code.  Recall, in fact, that in the review of the great perl6 conference,
several people walked out saying that they were just going to get into
Python.

Speed can't be the top factor here.  If you make extension code easy
enough to write, then people can profile and write "fast" code when
necessary.  And more to the current point, MT programming isn't
necessarily faster code (especially if you don't have multi-CPUs).  The
classic example of MT-code where you spawn off a new thread to do some
CPU-intesive work in the back-ground is less common than simple IO threads
(which includes server connection points and loading/saving files).  More
often, I see "forked" code which handles slow tasks.  They're reliable,
and have little over-head so long as the amount of work is great.

-Michael





RE: SV: Parrot multithreading?

2001-09-29 Thread Michael Maraist

> > or have entered a muteX,
>
> If they're holding a mutex over a function call without a
> _really_ good reason, it's their own fault.

General perl6 code is not going to be able to prevent someone from
calling code that in-tern calls XS-code.  Heck, most of what you do in
perl involves some sort of function call (such as stringifying).

whatever solution is found will probably have to deal with exceptions
/ events within a mutex.


That said, there's no reason why we can't have _all_ signal handler
code be:

void sig_handler:
  interp->signal=X;

This would just require some special handling within XS I suspect.

-Michael




Re: thread vs signal

2001-09-29 Thread Michael Maraist

>
> I generally divide signals into two groups:
>
>   *) Messages from outside (i.e. SIGHUP)
>   *) Indicators of Horrific Failure (i.e. SIGBUS)
>
> Generally speaking, parrot should probably just up and die for the first
> type, and turn the second into events.

I don't know.  SIGHUP is useful to capture.  Not to mention, how else
would you maintan Perl5 compatibility $SIG{HUP}.  If you're a daemon,
sighup is good to restart off of.

I'm seriously doubting that you're going to get a fully portable
threading model.  One size doesn't fit all.  Usually that's something
that the JIT would take care of (hense my preoccupation with
fake-threads)..  How does python handle MT?

-Michael




Re: [PATCH] print_s_v op (was: RE: variable number of arguments)

2001-09-25 Thread Michael Maraist

> All --
>
> > I've created a varargs-ish example by making a new op, print_s_v.
> > This is pretty rough, and I haven't updated the assembler, but it
> > seems to work.
>
> Um.. I *have* updated the assembler. Its the *dis*assembler I haven't
> updated. This is what happens:
>
>   * *_v ops list their number of args as 1 + the number of fixed
> args. They list the types of the fixed args, plus a last arg of 'v'.
>
>   * The assembler sees all this and splices in an arg count for the
> 'v' arg place, and then hangs however many args remained on the
> line after the count.
>
>   * The op decides how many args it is expecting. For print_s_v,
> that is via the format string (but really should take care to
> not read more than the "arg count" arg says are present.
>
>   * Since the op is in control of returning the next address to
> execute, it is trivial for the op to arrange for excecution to
> continue after the last vararg (thanks, Dan!)
>
>   * Since the assembler splices in the arg count, though, the
> disassembler won't have to think very hard to find the beginning
> of the next op when doing its job.
>
>   * Of course, all this assumes that all args have sizeof() ==
> sizeof(IV), which won't be true until NV constants are moved to
> the constants area. SO DON'T USE THEM. :)
>
> Anyway, whether or not folks really like the idea of having ops with
> variable numbers of args, this proves its not too hard to do with a
> small amount of cooperation between the assembler and the op func.

With var-args, we could produce highly efficient SIMD instructions.
printf obviously, multi-push/pop, createArrays/ Hashes.

I toyed with the idea of a stack-based ALU to handle algebra such as:

( ( a + b*c ) / d + e ) / ( f + g + h )

// compute dest, "fmt", register-names
compute Rd, "++^*+/+", f, g, h, a, b, c, d, e

Unfortunately I couldn't find a way of garunteeing that it's efficient.
The best I've come up with so far is to use a mini-switch statement.  The
only reason it might be faster than an optimized switch-statement over all
op-codes is that the switch has fewer terms.

The above case also gets complex if we don't use multiple registers for
intermediate calculations.

None the less the possibilities are cool.

-Michael




RE: Strings db

2001-09-25 Thread Michael Maraist

> and a call to the API would be:
> char *label = gettext( "This feels strange\n" );

Does you idea allow for:
int msgid = txtToMsgid( "This feels strange\n" );
char *label = msgidToRes( msgid );

In addition to the above, since this affords compile-time optimizations?

I'm not following this thread with enthusiasm (I'm an English Biggot
myself), but I'm assuming it involves (among other things) displaying
locale-based error messages.  Such messages could be known after
compilation.  Granted errors don't really need to be fail-fast.

-Michael




RE: variable number of arguments

2001-09-24 Thread Michael Maraist

> >
> >I have a minor issue with a proliferation of createArray.  In perl5 we
> >used the Stack for just about everything minus physically setting @x =
> >(1,2,3).  The creation of a dynamic array is a memory hog.
>
> Less of a hog in many ways than using a stack. Worth the times when it's not.

I don't see how.  A stack is "supposed" to be pre-existing and maintain
headroom for growth.  The simplest stack (which I think is plenty peppy)
is one that checks bounds and realloc's when it grows too much.  If it
grew that big once, then it would surely grow that bug in the future,
which alleviates memory fragmentation.

There's already a stack structure in interpreter_t.  I'm not sure how
parameter passing is being implemented yet, but anything other than some
type of stack is going to have problems. (which can only be addressed via
dynamic allocation of arrays.. which are just slower-parameter-stacks).

>
> >ESPECIALLY with
> >mark-and-sweep dynamic memory (which is sounds like perl6 is aiming for).
>
> I don't see the connection here.

The connection had to do with the allocation / deallocation of these
intermediate dynamic arrays.  It was just a call-of-attention to the ever
present point of performance degredation.

>
> If the compiler generates code like that and I'm happy about it, someone
> needs to smack me. *hard*. :)
>
> That ought to be either a series of print ops or, more likely, a set of
> push ops and a final prints.

But pushing onto what?  A PMC which points to a dynamic array?  Pushing is
arguable slower than pre-extending an array and statically assigning to an
index, since you avoid multiple increments.  Not to mention if you push a
million entries, you might have to regrow the array 20 or more times.

I stated why I used the array for the print, but you could substitude
print for "@y = map { .. } (1,2,@x)" which would require an intermediate
array of memory.  We just didn't have asembly available for that code.

You could set P1 to the new array and concat the three terms, but that
ultimately involves iterating through the array @x and "pushing values".
So you've not escaped the use of a stack.  What's more since this is a
temporary stack, you deal with memory management.

Once again, the alternative is a general purpose stack, who's memory
management is minimal.

> >In any case, what we've done here is dynamically create an array each time
> >through the loop.
>
> Yes, but we're doing this with a stack-based system anyway. It's just an
> anonymous pesudo-array (i.e. the stack top), but an array nonetheless.

I'm not convinced that we can't /shouldn't implement persistent
general-purpose stacks, which include parameter passing.  Obviously this
would require frame-support (since otherwise @_ would have to be a newly
allocated array), and it seems that the direction is towards
the exclusive use of registers.  I just don't know that this doesn't
provide even greater performance / memory penalties than for non-static
parameter functions (such as print / map / grep or non-proto-typed
subroutines).

The way I see such a stack being used:

interp:
  ..
  Stack stack;
  Stack frame;

AUTO_OP push_x {
  if ( stack->top >= stack->size ) {
grow_stack(stack);
  }
  setPMC( stack->data[ stack->top++ ], P1 );
}

AUTO_OP get_stack_frame_size {
  REG_INT(P1) = stack->top - frame->data[ frame->top ];
}

AUTO_OP push_frame {
  // auto-extend frame-stack
  frame->data[ frame->top++ ] = stack->top;
}

AUTO_OP get_stack_X_i {
  REG_X[P1] = [ stack->data[ frame->data[ frame->top ] + P2 ];
}
// set_stack_X_i
// alloc_stackn

AUTO_OP pop_frame {
  stack->top = frame->data[ --frame->top ];
}

Internal c-code could even get pointers to the base-frame so as to avoid
all the indirections.  So print, map, grep, etc would be able to read
quickly.  You could additionally have efficient op-codes that copy in PMC
arrays to the stack.


That said, if we're exclusively using registers for parameter passing then
there's currently the issue of efficiently passing and returning values.
Currently the only way to pass parameters is to duplicate a given register
set, then move parameters around so they'll be where a subroutine expects
it.  The called function would then push any additional
register-segments-stacks that weren't given parameters so it wouldn't
overwrite the caller's values.  When complete, it would restore those
reg-stacks that it pushed and would leave return values as presents in P1,
I1, etc.  But now the caller has to pop it's register-stacks to get back
to it's old values.  But in doing so, it'll lose the return value(s).

There are two ways I can see this working.  The first involves doing a
half-push / half-pop.  Where you have a 16register input window and a 16
register output window. If a function can accomplish it's tasks with
whatever is left-over, then it doesn't need to do another half-push (it
just can't touch P1..16).  On return, the return-values are in P17..P32.
The caller would have to copy the

Re: parrot malloc (was Re: Parrot coredumps on Solaris 8)

2001-09-24 Thread Michael Maraist

On Mon, 24 Sep 2001, Uri Guttman wrote:

> > "AB" == Alan Burlison <[EMAIL PROTECTED]> writes:
>
>   AB> If you are interested, I'll try to get a copy of the paper to you.
>   AB> It has some interesting ideas.  By having 'arenas' for each object
>   AB> type, you can remove a large part of the overhead of
>   AB> splitting&recombing memory cgunks.  The perl5 allocator already
>   AB> does something similar.
>
> sounds like a wheel worth looking at. are there working free C libs that
> implement it? how much work is it to write a new one that is
> parrot-centric? would we need all of it features that manage other
> things? i assume that the segmented stacks would also use arenas like
> this. we will need to define our set of arenas and sizes at some point.
>
> we haven't touched threads and malloc. does each thread want its own
> arena so we don't need locking?
>
Interesting idea.  The main allocator could be locked, so long as x% of
all allocations come from thread-specific arenas.  The only problem is
that this adds overhead to creating / deleting threads (setting up /
freeing arena chunks).  This would only be efficient if the main allocator
only returned chunks on the order of a page-size (which is then divided up
into smaller granuled arenas by the parallel procssing threads).

Another method would be to act like databases do - You finely tune your
memory allocator so that you only perform regional locks.  You also point
threads to different starting / insertion points along the memory chains.
This way you have a unified memory allocation scheme with only occasional
contention locks.  If all your data-structures are in large chunks (on the
order of a page), then this would do well. (Though it might require
architecture specific test-and-set operations for efficiency).

-Michael




RE: [PATCH] assemble.pl registers go from 0-31

2001-09-24 Thread Michael Maraist

> Just curious, do we need a dedicated zero register and sink register?
> The zero register always reads zero, and can not be written. The sink
> register can not be read, and write to it can be ignored.

I brain-stormed this idea a while ago, and here's what I came up with.
We're not RISC, so we don't need to worry about mapping "mov %r1 -> %r2"
to "or %r1, %r0 -> %r2", which is probably the most common use.
Additionally we'll always be able to use constants of size 32/64bits.
RISC could only use 8-22bit constants so often times it was easiest to
operate on a full 32bit zero.  We could always add a second constant
parameter to an instruction if it was demed useful enough.

As for a bit-buck, I can't imagine why you'd use it.  Side effects and
no-ops are all I can remember them ever being useful for.  But again we're
not RISC, we have a no-op, and can add separate functions for
side-effects.

Lastly, the main reason I say no:

AUTO_OP add_i_i_i {
  int p2_val = 0, p3_val = 0;
  if ( P2 != 0 ) {
p2_val = REG_INT(P2);
  }
  if ( P3 != 0 ) {
p3_val = REG_INT(P3);
  }
  if ( P1 == 0 ) {
p2_val+p3_val;
  } else {
REG_INT(P1) = p2_val+p3_val;
  }
}

Any questions?

You could have an optimizer convert all %i0 to constants and rearrange the
op-code so it maps to a real one, but you could just have had the compiler
produce proper code in the first place.  I don't think we're looking to
use assembly as one of the "supported" language layers.  Due to it's
propensity to change.

-Michael




RE: variable number of arguments

2001-09-24 Thread Michael Maraist


> > is it possible the ops to handle variable number of arguments, what I have
> > in mind :
> >
> > print I1,",",N2,"\n"
>
> This should be done by create array opcode plus print array opcode.
>
> [1, 2, 3, 4, 5]

I have a minor issue with a proliferation of createArray.  In perl5 we
used the Stack for just about everything minus physically setting @x =
(1,2,3).  The creation of a dynamic array is a memory hog. ESPECIALLY with
mark-and-sweep dynamic memory (which is sounds like perl6 is aiming for).

Lets say you wanted
for(...) {
  x++
  y++
  z++
  print "x=$x, y=$y, z=$z"
}
Then you'd possibly have

LOOP: cond_eq Ix, DONE
inc I1, 1
inc I2, 1
inc I3, 1
createArray P1, 6
setIndexOf P1, 0, "x="
setIndexOf P1, 1, I1
setIndexOf P1, 2, ", y="
setIndexOf P1, 3, I2
setIndexOf P1, 4, ", z="
setIndexOf P1, 5, I3
print P1
branch LOOP
DONE:

Obviously this could be written differently (say using 6 separate
print-statements.. but we're saying that print is slow, so we'd want to
minimize calls to it; possibly via performing an implicit join when
printing an array).  Additionally createArray could have happened on one
line, but that's not an issue either (this avoids type-issues).

In any case, what we've done here is dynamically create an array each time
through the loop.  Theoretically we could reuse the same array, but what
if instead of a loop this is part of a log function which happens to get
called a zilion times in debug mode.  For many types of gc-ers,
pointing to something new leaves the unreferenced data stranded until
we're either memory starved or a timer goes off.  In the case of a
createArray in an inner loop, we could segment the entire memory map into
chunks that fit this dynamic array before we starve for memory.  Depending
on the memory allocator, gcing that data will mean coallescing ALL of it
(potentially hundreds of meg worth).

Recently, there was an international programming contest for compressing
XML, and I wrote a program using perl.  Worked great, except when I used
input data on the order of a meg, I consumed 300-400meg worth of
user-memory.  No biggie, but when the program was finished (physically
output the result)  it took just as long to deallocate everything
(20minutes).  My guess is that the gnu memory allocator used by perl5 w/in
red-hat 7.1 did some heavy coallescing.  When I used perl5's shipped
memory allocator w/ no coallescing, the program exited in a nominal amount
of time (while saving some 100Meg of memory to boot).  Lets just say that
I developed a new fondness for the perl5 memory allocator.

The lesson I learned was to not take for granted dynamically allocated
memory).  A simple mark-and-sweep GC is going to have a major hickup
periodically with this sort of inner-loop allocation.  In this case
ref-counting would at least not leave a stragling allocation, and thus
would not micro-fragment / hickup.  I'm sure you can write the
mem-allocator to gc before fragmenting larger chunks, but it's still
excess CPU overhead for this task.

The point here is to avoid the use of dynamic allocation where possible;
especially with the internals.

This is a perfect use for the perl-stack, short and simple.

-Michael





RE: Strings db

2001-09-24 Thread Michael Maraist

> GNU does offer the gettext tools library for just such a purpose. I don't
> know how it will translate to the various platforms however, and it likely
> is a major overkill for what we are trying to do.
> http://www.gnu.org/manual/gettext/html_mono/gettext.html#SEC2 - Purpose
> It might make sense to implement something like this now, rather than create
> our own system and find out it is insufficient down the road.
> Just a thought,
> Grant M.

But wouldn't that make parrot GPL'd?

-Michael




Re: SV: Parrot multithreading?

2001-09-24 Thread Michael Maraist

> >then what about a win/win? we could make the event checking style a
> >compile time option.
>
> Odds are you'll get per-op event checking if you enable debugging, since
> the debugging oploop will really be a generic "check event every op" loop
> that happens to have the "pending debugging event" bit permanently set.
> Dunno whether we want to force this at compile time or consider some way to
> set it at runtime. I'd really like to be able to switch oploops
> dynamically, but I can't think of a good way to do that efficiently.
>

long-jump!!!

runops(bla bla){
  setjmp(..);
  switch(flags) {
fast_runops(bla bla);
debug_runops(bla bla);
trace_runops(bla bla);
conservative_runops(bla bla);
thread_safe_runops(bla bla);
  }
}

AUTO_OP sys_opcode_change_runops {
  bla bla
  set run-flags..
  longjmp(..)
}

In C++ I'd say throw the appropriate exception, but this is close enough.

This would work well for fake-threads too, since each thread might have a
different desired main-loop.  You'd have to do something like this if you
transitioned bewteen non-threaded and threaded anyway.

-Michael




Re: SV: Parrot multithreading?

2001-09-24 Thread Michael Maraist


> Odds are you'll get per-op event checking if you enable debugging, since
> the debugging oploop will really be a generic "check event every op" loop
> that happens to have the "pending debugging event" bit permanently set.
> Dunno whether we want to force this at compile time or consider some way to
> set it at runtime. I'd really like to be able to switch oploops
> dynamically, but I can't think of a good way to do that efficiently.

If you're looking to dynamically "insert statis checks every op", then
that sounds like picking a different runops function.  We've already got a
trace varient.  We could farm out a couple of these and have execution
flags specify which one to use.  If you wanted every 5'th op to check
flags, you could trivially do:

while(code) {
  DO_OP(..)

  if(code) DO_OP(..)

  if(code) DO_OP(..)

  if(code) DO_OP(..)

  if(code) DO_OP(..)

  CHECK_EVENTS(interp)
}

The inner loop is a little bigger, but aside from cache-issues, has no
performance overhead.  This would prevent having to interleave check-ops
everywhere (more importantly, it would reduce the complexity of the
compiler which would have to garuntee the injection of check-events inside
all code-paths (especially for complex flow-control like "last FOO".
You could use asynchronous timers to set various flags in the check-events
section (such as gc every so-often).  Of course this requires using a more
sophisticated "alarm/sleep" control system than the simple wrapper around
alarm/sleep and $SIG{X}, etc.

Other methods might be whenever a dynamic variable referencee is
reassigned / derefed, an event flag is set to Q the gc, etc.

-Michael




Re: Curious about Parrot + Closures

2001-09-24 Thread Michael Maraist

> I'm just curious, is there a plan for how closures will work in Parrot?  I
> think that closures are one of the coolest Perl 5 features around (despite
> their memory leak issues :-), and I'd hate to see them go away.

I doubt that there's any limitation.  In Java, all they had to do was
supply a reference to the enclosing object and compile a new class which
used some local and some "friend member vars".  We're obviously not
fundamentally OO here so this approach won't work.  We still haven't seen
thow PMVs (or whatever they're called).  They contain a structure.
That's probably perfect for the reproduction of perl5 CV's which I believe
contained (among other things) the "locals" which included references to
closure values.

In short.  Fear not.
-Michael




Re: Parrot coredumps on Solaris 8

2001-09-24 Thread Michael Maraist

>   DS> At 12:29 AM 9/21/2001 +0200, Bart Lateur wrote:
>
>   >> >Horribly wasteful of memory, definitely, and the final allocation system
>   >> >will do things better, but this is OK to start.
>   >>
>   >> So to stop it waste memory, subtract 1 first and add it again later.
>
>   DS> Nah, it'll still waste memory. It can't not, because of the way it
>   DS> works.  It has to ask for a big gob of memory and ignore bits to
>   DS> make sure it gets the alignment right. This particular
>   DS> implementation's not all that efficient I'm sure, but it'll get
>   DS> ripped out and replaced when we have a real memory management
>   DS> system.
>
> what about a binary type allocation system?
>
> we have multiple allocation structures in an array with each one
> allocating the next size up binary size. so the lowest level does say 32
> bytes (or whatever our smallest granularity is). the highest level could
> be in the multi-megabytes or larger. when a size is requested it go to
> that size queue and if there is a free block on its list, it gets it. if
> no block exists, it requests from a larger size and will break it
> down. now, if you allocate with malloc and it is not on the proper,
> boundary the preceding ram is broekn into smaller parts and put onthose
> queues. so no (or little) ram is wasted. all blocks are allocated on
> proper boundaries and you can do binary coallescing during (incremental)
> GC.
>
> a first request of a small block may allocate 32 (or some other 2**N) of
> them in a large chunk. that large blocks are managed by just slice off
> the leading chunk and moving a pointer so that is very fast. when the
> block is gone and the queue is empty, a new larger block is allocated.
>
> once the queues get populated some, it will be very fast.
>
> lemme try to whip out something like this soon and see what happens. or
> does anyone else want to hack it with or without me?
>
> uri

Why aren't we just using the perl5 memory manager?  Cut-and-paste right?
It used 2k page alignments for the most part.  That should account for
most of our alignment needs.  Actually I think it put the size as the
first several bytes.  But this could be rectified by "spilling into the
predecessor".  Namely All allocations start at 2K but spill back 4 or so
bytes.  This means that free chains have to be <=4 above the grain.  If
you use pow2 based allocation instead of physical size-of-data, then you
can assume that it's 2^idx-4 when allocating.  It's more wasteful of
memory, but with things like 2k-4B pages and clustered sub 2K allocations,
it's very efficient.

Until we get to that point, our current hack of a memory allocator is fine
for any applications we're capable of writing with this boot-strap API.

-Michael




Re: Wow.

2001-09-24 Thread Michael Maraist

On Mon, 24 Sep 2001, Buggs wrote:

> On Monday 24 September 2001 03:27, Dan Sugalski wrote:
> > At 01:47 AM 9/24/2001 +0100, Simon Cozens wrote:
> > >http://astray.com/mandlebrot.pasm
> > >
> > >Leon, you're a sick, sick man.
> >
> > Okay, I think that means we need to weld in bitmap handling opcodes into
> > the interpreter. :)

Woohoo.. How many times have we seen code like this:
sub log2($)
{
  my $val = shift;
  my $idx = 0;
  $idx++ while $val >>=1;
  return $idx;
}

I'd love to see a full compliment of bit-code like the 386 had (oh the
days). :)  That includes get_index_of_left_most_bit (alias log2_i_i),
get_index_of_first_[one|zero] (which is useful for searching bitmaps).
count_ones.  I'd have to go get my old [345]86 assembly book for the rest.
This is useful stuff people. :)  Heck, on x86 platforms would could use
#ifdefs and assembly. :)  Tell me this wouldn't be cool?

-Michael




Re: Suggestion: register liveness information

2001-09-24 Thread Michael Maraist

> I have a suggestion for allowing parrot implementations to execute
> code more efficiently.  Add an "instruction" or other annotation which
> denotes what registers are "live" at some point in the code.  The

Does it have to be in the instruction stream to be useful?  Why not just
be part of the constants /fixup segment.  This way non-JIT execution won't
require wasted op-execution (which we're devilishly trying to pre-maturely
optimize :).  You could profile things via perl6-code attributes which
give clues to an optimizer or JIT (such as frequency of use, etc).  In any
case, I can't imagine that a JIT compiler is expected to quickly execute.
What you have suggested is summary code, which can be determined at JIT
compilation time.  I'm not convinced that the waste of CPU for invocation
of op-codes that aren't used in non-JIT form (which is the case when you
know you're only going to run this once) is justified for such simplistic
information.

However, a similar idea of "live-registers" exists on newer
CPU-architecture.  Namely mapping how many registers need to survive the
subroutine call.  The block, in turn, specifies how many it will use
within it's block.  The CPU then frees non-maintained registers and
allocates just enough.  The IA64 does something along these lines.

Here, however, we're not starved for register space;  We're utilizing
multiple stack-streams.  The beginning of a block simply allocates new
stack space on whichever stacks it requires.  The only optimization in
this manner is to have the caller specify if it has used less than the
full compliment (or 32 or so), which would allow the potential
stack-allocation to save a block of 8, 16 or 24 registers (useful when
only 8 are used.. This also rewards compiler optimizatios which reuse
registers; which has cache rewards as well).  Namely:
general-op
general-op
keep_int_regs %16
jsr _foo_func
general-op

_foo_func:
push_int_regs  // only shifts reg-stack by 16 (properly handles over-flow)
push_num_regs  // shifts a full 32
general-ops
...
pop_num_regs // shifts back 32
pop_int_regs // remembered to shift back only 16
ret

Unfortunately this trades off memory savings for extra CPU overhead.  You
have to maintain a seperate "shift-stack" which says how much to shift
each time you do a push/pop.  The current shift-value is set via the
new op-code "keep_x_regs" which also introduces over-head.

I personally would like to see smaller subroutines and more of them, but I
don't know that we're really going to be wasting a whole hell of a lot of
stack-space.  At least not enough to be worth the CPU-overhead.

In general, if the hint-information is useful for non-JIT execution as
well, then I'd recommend it, but I'm not seeing this particular point as
being worth the cost.

-Michael




Re: Using int32_t instead of IV for code

2001-09-24 Thread Michael Maraist


> >>We're talking bytecode. That will indeed be a case of "huge arrays of
> >>tightly packed integers".
> >
> >For bytecode, it's not a big problem, certainly not one I'm worried about.
> >Machines that want 64-bit ints have, likely speaking, more than enough
> >memory to handle the larger bytecode.
>
> I'm more worried about the cache. For 32 bit bytecodes, the same program
> will be only half the size than for 64 bit. Or: you can fit a twice as
> large program in the cache, or two of them (for multitasking). That will
> mean a speed-up, and likely a vast one for programs with sizes close
> enough the cache size.
>
If this is the case ten why are we using 32bit register identifiers?
Obviously it makes code easier to write.  But at some point are we going
to compress the byte-code?  Along with a previous email that suggested
that byte-code is only going to be valid on a given machine with a
pre-compiled parrot / perl6 core, the bytecode won't need to worry about
the number of registers, etc.  Most VM architectures use 8 or 16 bits for
the op-code (including the register map).

Here are the pros and cons as I see them:
Cons:
  o 8bit op-codes dramatically limits number of "macro-ops" or advanced
ops
  o 16bit op-codes have potentially screwy alignment issues.
  o 4bit register addresses have definatly screwy alignment issues
(requires masking to extract values which takes more cpu-time)
  o sub-32-bit ops might be slower on non x86 architecture (since more and
more are 64 or 32bit only; and require special ops to munch sub-32bit
data; namely alpha)
  o If the constant-table index was chosen 16bits, we'd limit the number
of entries to 64K per code-segment (But who would ever use more than this
in p-code ;)
  o If the address was limited to 16 bits, we'd have code-size limitation
of som 4K ops per code-segment.
  o allowing p-code-size to be an issue dramatically increases the number
of op-codes earlier on in development (write this as a bug-enchancement
instead?).  Namely we have inc16_i8_ic8, inc16_i8_ic16, inc16_i8_ic32,
add16_i8_i8_ic16, add16_i8_i8_ic32.
  o larger number of op-codes means more c-code, which means a trade-off
between D-cache (for byte-code) and C-cache (for extra c-code).

Pros:
  o dramaticaly compressed op-code size (from 128bits on average to
32,48,64).  Saves both on disk space and cache-space - tighter
inner-loops.
  o Potentially more highly optimized c-code; 16bit adds are somewhat
faster on some architectures - do what's needed when it's needed.  64bit
archs will upcast everythign anyway.
  o If we eventually determined a max-code-length (of like 64bits after
alignment), then we could just make all code that big.  This would
dramatically reduce c-code over-head (no offset = DEFAULT_SIZE; offset +=
rel_address; return offset; .. code = offset;).  This would additionally
reduce the risk of jumping into the middle of an op-code.  Heck we could
do this now; simply profile all op-codes to determine the max code-size,
then pad all op-codes to that size.  Given that we're not into
dynamic-opcodes, and most everything is being pushed into the constants
area, I don't see much danger in it.

Food for thought..

-Michael




Re: Using int32_t instead of IV for code

2001-09-23 Thread Michael Maraist



> At 05:32 PM 9/23/2001 +0200, Bart Lateur wrote:
> >On Thu, 13 Sep 2001 06:27:27 +0300 [ooh I'm far behind on these lists],
> >Jarkko Hietaniemi wrote:
> >
> > >I always see this claim ("why would you use 64 bits unless you really
> > >need them big, they must be such a waste") being bandied around, without
> > >much hard numbers to support the claims.
> >
> > >Unless you are thinking of huge and/or multidimensional arrays
> > >of tightly packed integers, I don't think you should care.
> >
> >We're talking bytecode. That will indeed be a case of "huge arrays of
> >tightly packed integers".
>
> For bytecode, it's not a big problem, certainly not one I'm worried about.
> Machines that want 64-bit ints have, likely speaking, more than enough
> memory to handle the larger bytecode.
>
That's not the problem.  The problem is protability of the byte-code.
I've heard various discussions about byte-code-compiling.  One of which
was that it is not considered cosher to let a program compile code and
write into a "/usr/local/scripts" directory or any such thing.  Many
companies want to be able to distribute "byte-commpiled" code (such as
with python).

Though I don't know what the powers on high want, are we saying here and
now that the byte-code is strickly non-portable?  That parrot code is only
as good as the machine it was compiled on?  64-bit compiled parrot-code
will not work on 32-bit machines.

If we're making that distinction, then I'm under the impression that a
whole heck of a lot more optimizations can be made for the byte-code than
simple word-size.  Not least of which would be in-line assembly (from
xs-code).

-Michael




Re: Using int32_t instead of IV for code

2001-09-23 Thread Michael Maraist

On Sun, 23 Sep 2001, Bart Lateur wrote:

> On Thu, 13 Sep 2001 06:27:27 +0300 [ooh I'm far behind on these lists],
> Jarkko Hietaniemi wrote:
>
> >I always see this claim ("why would you use 64 bits unless you really
> >need them big, they must be such a waste") being bandied around, without
> >much hard numbers to support the claims.
>
> >Unless you are thinking of huge and/or multidimensional arrays
> >of tightly packed integers, I don't think you should care.
>
> We're talking bytecode. That will indeed be a case of "huge arrays of
> tightly packed integers".

I was under the impression that the byte-code was all goin to be
integer-based (op-code-id, reg-id, 32bit-int-constants, etc).  Anything
else should be in the constant pool. I'm not sure how damaging 32bit
pointers are in a 64bit arch, but the byte-stream "addresses" are simply
32bit offsets.  I can't imagine a 4.1Gig opcode chain.

Thus as long as we allow long long int and [long] double in the constant
block, I don't see a problem with compatibility between 32/64bit archs.
The constant-declaration should be able to specify the granularity.

-Michael




Re: Draft switch for DO_OP() :-)

2001-09-22 Thread Michael Maraist

> On Thu, Sep 20, 2001 at 11:11:42AM -0400, Dan Sugalski wrote:
> > Actually the ops=>C conversion was conceived to do exactly what's being
> > done now--to abstract out the body of the opcodes so that they could be
> > turned into a switch, or turned into generated machine code, or TIL'd. If
> > you're finding that this isn't working well it's a sign we need to change
> > things some so they will. (Better now than in six months...)
>
> The problem is that the conversion currently done by process_opcodes.pl
> translates the op definitions into functions, and leaves the remainder
> of the file untouched.  This is useful, because it allows the opcode
> file to include headers, declare file-scope variables, and the like.
> Unfortunately, when translating the ops into a switch statement in a
> header file, there is no place to put this non-opcode code.

I didn't have a problem with this.  It just took some clever perl5
programming (namely outputting to separate streams and combining them at
the very end).  It's not fool-proof, but works for the important stuff
(like includes / globals).

>
> There are a few approaches we can take.  The simplest, I think, is to
> ignore the problem when generating inline ops; given that these ops
> are going to be compiled at Perl build time (they can never be
> dynamically loaded for obvious reasons), we can manually put any
> required #includes in interpreter.c.  Files containing dynamically-
> loaded ops can be generated in the same way that process_opcodes.pl
> does now, preserving the file-scope code.
>
> Another approach would be to include a means of defining information
> that must be included by the file implementing the ops.  For example:
>
>   HEADER {
>   #include 
>   }
>
> This would then be placed into interp_guts.h.  (Possibly surrounded
> by a conditional guard (#ifdef PARROT_OP_IMPLEMENTATION), so no file
> other than interpreter.h will pick up that code.)
>
> - Damien
>

- /= |_ |_| /= /= \./





switch-based interpreter

2001-09-22 Thread Michael Maraist

I don't remember if this has already been experimented with;  I've
lost track of what the previous sets of benchmarks were for.  I was
curious to see how much faster a single-function switch based interpreter
would go.  I took process_opfunc.pl and created process_opfunc_switch.pl
which does the work.  I added a few additional optimizations such as
copying many of the interpreter variables into local variables (to
reduce the indirection by one).  I then redefined the register access
macros locally to account for this.  I also mungled "RETURN" keyword to
directly increment the code pointer.  The only problem I encountered was
that the existing parser wasn't designed to adapt well to the new "nested
code", so the comments outside the functional blocks (of
basic_opcodes.ops) are no longer are near their respective code. (Also
identified a bug in process_opfunc.pl, which I'll post patches for later).

Using a 466MHZ dual celeron and a 600M instruction loop (bench.jako with
different limits):

Stock-code  default compiler options8.2M ops/s
Stock-code  -O3 with the works  13.5M ops/s
switch-baseddefault compiler options11.5M ops/s
switch-based-O3 with the works  28.9M ops/s

# for reference a perl -e '...' was run with equivalent code
perl5.6 (redhat built -O2)  4.2M ops/s

I actually run code with and without the bastardized local copies of
INT_REG, NUM_REG.  W/ default compiler options this made a difference of
several Mops/s, w/ -O3 the time difference was neglegable.

One additional optimization that I'm going to try and make:  The *.ops
code is not ordered with respect to the op-code-id, so the switch
statement couldn't be optimized by gcc (it just created a local
jump-table, which essentially is the same as vtable[*code] minus the minor
overhead of subroutine stack operations (which occur in parallel).  I'm
going to redesign the compiler such that the switch statement is
numerically ordered.  I believe this will cause a direct PC + *code jump
statement.  Lastly the highly tuned compiler decided to litter the code
with branch statements.  I'm guessing it was to maximize code-reuse
(possible reduction in cache size??), though this thwarts much of the
point of having code blocks close to each other.

The end result is that the performance gain is not spectacular (at least
with this minimally cached celeron).  The test times were between 140s and
23s with the lower ends having only 10s of seconds between them. There was
a varience of 2 or more seconds for consecutive executions, so this isn't
definitive.


I'm away from my usual internet connection so it' inconvinient to post
patches and files.  If anyone really wants to see them, I'll make a
special effort.

-Michael




Re: A task independent of the freeze

2001-09-21 Thread Michael Maraist


> On Fri, Sep 21, 2001 at 02:24:43PM -0400, Dan Sugalski wrote:
> > Doing this by hand with -O3, you can see a speedup of around a factor of 45
> > over an unoptimised runops loop, so it's definitely worth doing in some
> > cases...
>
> Cool! Parrot JIT!
>
> But that tells us *just* how time-consuming our runops loop is.
> :(
>

I think the slowest part is the function-call.  But this 45x performance
boost is probably measured via x86.  SPARC has a very nice function-call
optimization (no-main-memory needed).  Given that the inner loop probably
maxes out at 5 function calls deep (do->op->api_func->c_func..), SPARC is
optimal for parrot.

Simply going to a switch-archetecture gets rid of the "op-code"
function-call.  The only remaining slowdown is the code-branching (which
thwarts pipelining / instruction-level-parallelism).

Either way, branching can be reduced by a clustering a sequence of
regularly used-ops together into macro-ops.  I'm not of the opinion that
concatenating everything into one monolithic function is of use
(converting jsr/branch to gotos), since it doesn't extend well to dynamic
operation (redefining functions / dynamically loading code).  A c-function
call isn't that much more destructive than a non-local goto, so I'd
recommend these "macro-ops" be comprised of 2-or-more op-codes grouped
into a function (propbably using a hash-coded function-name: F124x3ba for
example).  Using hash-tables at compile-time, it would be possible to
reuse macro-ops throughout the code.  The more code-reuse (through
variable-complexity statistical code-analysis; e.g.  -O2, -O3, etc), the
better locality and cache-coherence, and the smaller the executable size.
What's more, it might even be possible to reuse these macro-ops in eval /
"use" statements if the compile-info is maintained until run-time.

The macro-ops can be used either as a free-lance collection of functions
(which are compiled and linked-in dynamically at run-time
(portability-issues?)), or as a monolithic switch-statement (of ops +
macro-ops).  Note, there is little need to group the macro-ops together
into a monolithic function, since the entirety of the main-code could
become a single macro-op (which optionally inlines the other macro-ops).

Ideally the ultimate jit would generate assembly and concatenate this into
an existing executable memory segment (self-modifying code). (yeah right)

As a plan of attack, I can see the first stage producing a monolithic
function (as with the original request), then later breaking out
function-blocks (anything in brackets, including the eventual subroutine)
into macro-ops, then just for fun (or when there's time) performing
statistic analysis to maximize code-reuse.

As a further note, macro-op code-reuse would be most efficient if we had a
manner of generalizing parameters.  This isn't a problem in stack-based
code (such as java), but we have to deal with it due to mappable
registers.  Two code segments may only differ in the specific registers
utilized (based on compiler allocation), thereby thwarting consolidation.
As a possible solution, the macro-op is provided the complete set of
register-addresses and ignores those specified by the byte-code-generator.
Namely:

Original byte-code segment:
 set I1, {constantA}
 set I2, {constantB}
 add I3, I1, I2

The macro-op _could_ be:
 F12345(code,interp):
  REG_INT(1) = {constantA};
  REG_INT(2) = {constantB};
  REG_INT(3) = REG_INT(1) + REG_INT(2)

Note how we've hard-coded the reg-addrs.  This function takes no
parameters.

My proposal is:
 F12345(code,interp)
  REG_INT(P1) = P4;
  REG_INT(P2) = P5;
  REG_INT(P3) = REG_INT(P2) + REG_INT(P3);

The above can be interpreted as either a generic function with extended
args which will be called with F12345(code,interp, P1, P2, P3, P4, P5)
inside another macro-op, or via a custom-built p-code which acts in all
ways identical to our original p-code except for the dramatic expansion of
op-codes.  What's interseting about this approach is that it can use the
same interpreter for optimized code.  The main functinoal difference is
that we've reduced the size of the byte-code by several orders of
magnitude. Inner-loops can possibly fly (w/o any function-calls nor
non-local branches).

The calling sequence could be as follows:
  edit foo.pl
  perl6 foo.pl
compile foo.pl, generate pasm (output if -S option)
assemble pasm into pbc (output if -o option).
If (-Ox):
  generate macro-ops from pbc
  generate non-portable-pbc from macro-ops
  compile macro-op-code to executable code (DLL/so where available)
  output if -b option, else write to temp-file
  if can-dynamically-load: load library; else: exec new-code
run interpreter on pbc

Note: you can invoke perl6 with .pl, .pasm, .pbc, .npbc:.nplib
(non-portable-byte-code and it's associated library)

Food for thought...





do-loop too restrictive?

2001-09-21 Thread Michael Maraist

Question.  It seems to me that the current do-loop:
 while(code > x && code < y && *code) { DO_OP }

Is fail-fast and succeed-slow. I know it has a brother:
 "while(*code)", but this could lead to seg-faults, especially if we allow
dynamically modified parsers / compilers.

The first method has an even more significant problem though.  Namely that
the interpreter can't possibly know how big the code-segment is, since
there is the possibility of dynamic loads of p-code or dynamically
generated "eval-code".  This is only an issue if parrot assumes a
late-bound compiler, but I'm under the impression that it's strictly
compatible with perl5 which obviously is.

>From this, an earlier bias I had is further reinforced.  Since we don't
know the upper and lower bounds of allowed code, and it would be nice to
avoid seg-faults, I recommend:

while(code) { DO_OP }

This is success-fast, and fail only requires execution of the "end"
op-code. (which only executes once for the whole program).  It's redundant
to check for the end-op as with while(code && *code){} and it only hurts
mainline performance (though I know that's not yet an issue).

To maintain strict p-code checking, you'd have to dynamically load the
extents of the code-segment (most likely from the interpreter), as with:

while( code > INTERP->start_code && code < INTERP->end_code ) { DO_OP }
jsr's and ret's might modify start/end extents, and so on. (Again *code is
redundant).

A "premature" optimzation could be to have nested do-loops:

func runops( code, interp ):
  while( !interp->f_done ):
_innerrunops( interp->start_code, interp, interp->start_code,
  interp->start_code + interp->code_size
  /* other run-time constants */ );
CHECK_SYSTEM_STATE // potentially avoids litering CHECK_EVENT op-codes

func _innerrunops( code, interp, start, end /* others */ ):
  while( code > start && code < end ) { DO_OP }

This would assume that far-jmp / far-jsr (those based on dynamically
linked code) set interp->code and return zero.

I'm not crazy about it; if nothing else, it requires an extra c-function
call, but it might be necessary when we get to dynamic linkiing anyway to
additionally configure the interpreter state (switching current packages,
etc)

I've only recently gotten into this mail-group, so I've missed any
previous justification for the existing use of while(*code), nor any
direction involving dynamic linking.

-Michael





RE: question about branching/returning

2001-09-20 Thread Michael Maraist

On Thu, 20 Sep 2001, Brent Dax wrote:

> Damien Neil:
> # "RETURN(0);" (written exactly like that, no variation permitted)
> # is a special case, and terminates the runops loop.  The only op
> # which uses this is "end", and it doesn't actually ever execute.
> # Personally, I feel that this special case should be removed.
>
> We should probably just check if the thing returned 0 and exit runops if
> that's the case.  This will also protect us against utter stupidity
> like:
>
>   FOO:jump FOO
>
> which wouldn't be a bad thing to protect against.

To my understanding jump FOO would physically map to the c-code "return
code + code[1]", which means the number zero is not returned, even
thoughthe offset of FOO is zero bytes. (I'm away from the source code so I
can't verify this, but it's what I remember)

-Michael




Re: RFC 310 (v1) Ordered bytecode

2000-09-25 Thread Michael Maraist

> Ordered bytecode
>
> Bytecode should be structured in such a way that reading and executing
> it can be parallelised.
>

Are you suggesting a threaded VM?  I know that the core is being rewritten,
so it's a possibility.  If this is the case, then you'll want to reference
some of the other RFC's relating to threading.

It's an interesting idea, and it reminds me a lot of IA64 chained lists of
non-dependant instructions.
The only way I can imagine this being useful in a non-optimized fashion is
if raw async-io is performed.  Namely, read a minimal chunk, read the chunk,
figure out where the next chunk is (if it's not already cached), initiate an
async read on the newly determined chunk from disk, begin execution /
initialization of the current chunk until the end of the chunk is reached.
In bound async-io (through interrupts) appends itself to the input queue.

If that model could work (unfornately, doesn't work on all OS's), then it
would make maximal use of CPU time (especially on single-CPU's), without
dealing with race-conditions and other such evils inherent in MT
programming.

In general, however, I don't see bytecode reading as being the real
bottle-neck.  Take the following virtual charts:
Compile Time / Byte-code read Time / core execution time # comment

1 / 2 / x # faster to compile out-right.  Need to reimplement byte-code so
it does less.  parallel byte-code doesn't help any

4 / 2 / .5 # we actually get a performance boost here from the byte-code, so
it's not as critical that we shave off .4 total time (assuming that's even
possible)

x / 2 / 20  # compile / loading time is irrelavant compared to the whole
execution time.

4 / 2 / 2# here is the main candidate for your parallel loader.  We
could possibly interleave execution with byte-loading, thus shaving total
time from 4s to possibly 2.1 or 3.

One variable is whether the extreme loading time is due to excessive
disk-size, or to execessive computation.  If it's due to disk size, then we
have more problems to deal with, and a complete redo of byte-code should be
done.  If it's due to computation (memory allocation /etc ), then single-CPU
implementations aren't going to gain anything.  In fact, the additional
context switching involved in MT for most systems, is going to provide more
overhead than you are likely to save in the operation.  The only real
winners are multi-CPU systems that are mostly idle or otherwise dedicated to
many start and stop operations (say in a parallel make that heavily utilizes
perl-code).

My suggestion (which should probably become it's own RFC), is that we not
store raw byte-code to a ".p[lm]c" file, or what-have-you.  But instead
store a non-ambiguous, token-based version of the original english code.

Essentially, there'd be 2 front ends to perl, the english-eval, and the
token-eval.  Everything else is the same.  The reason for this is that the
source code seems to be an order of magnitude smaller than the compiled
code, and there seems to be a great deal of compilation going on in order to
map that file to memory.

With the token'd approach, the .pmc file would be significantly smaller than
the original .pm file since there would be no white-space, and all operators
/ local-function-calls would be reduced to single byte operations.
Algorithms and optimizations could be added to the token file as an expanded
list of attributes that modify functions / variables (such as whether
something is an int, etc).  The goal would be that the code analysis and
optimization stages would be removed from the compilation.

I don't know what pascal or python use for object code, though I'm sure
they're effectively what perl's byte-code turns out to be (albeit
significantly smaller).  Java takes the whole approach of a true virtual
machine, and uses assembly-type-languge accordingly.  Perl doesn't really
seem to be able to take that route, but we'll see.

It's an interesting topic, but we're not done with it yet.

-Michael








Re: RFC 301 (v1) Cache byte-compiled programs and modules

2000-09-25 Thread Michael Maraist

> So, we check for the existence of a C<.plc> file before running a
> program; if the C<.plc> file is newer than the program, we use that
> instead. If there isn't a C<.plc> file or it's older than the program,
> recompile and dump the bytecode to a C<.plc> file. Naturally, this gives
> us the best speedup for modules which change very, very infrequently,
> rather than programs which can change a lot during development. Maybe
> you only want to do this for modules, then.

I suggested this a while ago, and the response was that automatically
writing files is a security risk.
You should extend your RFC to describe a caching directory or configuration.

Python optionally performed the disk-write - eg if the write failed, it went
on about it's business.  The closest that I could image to security would be
compiling to .pmc files at perl / module install time (by a privledged
user), and to disallow write-access to site_perl for anyone else.

The creation of a .plc is somewhat different (since there's no installation
process).  As a security risk, tainted code could write out a binary file to
replace the .plc file, and thus introduce a hidden virus that would be
almost indetectable.  Since the .plc file would be newer, it would never get
recompiled (until a code change).  Any sort of code analysis wouldn't reveal
the existence of a virus either.  The closest you could do would be to write
a .plc checker, which regenerates the .plc and compares it to files.

Other issues are simply disk-space considerations.  .plc might be large, and
dozens or hundreds of scripts would consume a great amount of disk space.
Then you have to worry about compatibility.  Copying  .pl* files from
computer to computer _might_ cause problems if not implemented correctly
(namely that a perl and module version is associated with each .p[lm]c
file).

In general, I'm for the idea, but I figured I'd pass on the obsticles I
encountered.

To further your RFC, I'd suggest working out a configuration for a .*c cache
directory, which could put upper limits on disk-space usage for compiled
files.  Again, this is more of a consideration for .pl files which could
live just about anywhere on the system.

I'd suggest things like cached modules candidates require a non-zero value
of VERSION.

That's all I can think of at the moment.

-Michael





Re: RFC 172 (v1) Precompiled Perl scripts.

2000-08-29 Thread Michael Maraist


- Original Message -
From: "Perl6 RFC Librarian" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, August 29, 2000 10:20 PM
Subject: RFC 172 (v1) Precompiled Perl scripts.


> This and other RFCs are available on the web at
>   http://dev.perl.org/rfc/
>
> =head1 TITLE
>
> Precompiled Perl scripts.
>
> =head1 ABSTRACT
>
> The functionality to store/load precompiled perl scripts may be built into
> the core.

You're on the right track.  May I make a suggestion.  Ok, I'm obviously
biased towards Python.  But Larry _did_ say that we were allowed to steal
back from other languages.

The way python works is that in the library tree, it looks for any of the
following:
.py, pyc, and .so.

You can probably guess what's what.  If only the .py was found, it
immediately writes out the .pyc when it finishes compiling.  This happens
every time (unless it can't write to the directory).

In perl, we already have .pm and .so.  They're in different directory trees,
but that's mainly because of architectural differences.. No reason why a
.pmc couldn't be written to the site_perl directory on the fly.  The
standard lib should probably generate the .pmc's at install time.

I would make a seperate RFC, but your title is too similar, so I'd rather
coerce you into this line of thinking.

>
> =head1 DESCRIPTION
>
> There exists a possibility to do the thing. One can store compiled perl
> code in file and load it again via B::Bytecode. But the produced file
> is large. The loading time is comparable to compiling the code anew.
> Several simple experiments show that disk i/o times does not affect too
> much - almost all data are in cache. The B extension mechanisms and
> algorithms are too slow. Therefore it is unusable.

Obviously, perl still needs some work on the op-tree storage / retrival
mechanism before this becomes practical.

>
> Why bother? Consider the script that uses xml/xsl. It takes a lot of time
> to get such a script running. And the debugging process becomes harder
> with annoying initialization (every single .pm must be read by OS from
> numeric files and compiled from the very beginning). Another example is
> web server. Why loose CPU time to compile the same scripts?
>
> =head1 IMPLEMENTATION
>
> Perl may have an option to vary the target precompiled format. The format
> affects the generated precompiled file size. User may choose appropriate
> effective format to minimize loading time. It clearly depends on the
> system which runs the script.
>

-Michael




Re: RFC 146 (v1) Remove socket functions from core

2000-08-25 Thread Michael Maraist

> >I don't understand this desire to not want anything to change.
>
> You misread.

I sympathise.  There are definate goals and focuses that each language is
built around.. Change these too much, and you have a different language,
while at the same time, alienating the people that chose that language
because of those met goals.

Perl is fun, functional, and efficient.  I can artistically scuplt a program
in one line at the command prompt and do things that wield incredible power,
and feel an incredible sence of happiness.  While at the same time, I can
write a massive program, and not have to worry about many of the mundane
details that we learn in algorithms / data-structures.

> >This is an
> >opportunity to clean up the language, make it more useable, and more fun.
> >I would have a lot more fun if perl were a better performer and if it was
> >easy for me to expand it, contract it, reshape it, improve it, etc.

> You will *not* improve the performance of the inner interpreter
> loop by having fewer opcodes to switch on.  Whether the number is
> 20 or 200, it's the same thing--*think* aboutit.

Well, not strictly true.  Though the raw code will not switch any faster,
the bloat of the core will hurt you with non-locality, through cache misses,
or worse memory-paging.  A highly tuned 50K core interpreter will do much
better than a 500K list of regularly used libraries.  Essentially, it should
be possible to perform profiling on suites of programs to find out what is
really most needed, and only allow them in the core.  The rest could be
delegated to slower dynamic linking or outer-edge statically linked code.
It's the general idea of RISC, of course.  I have no real backing to say
that this would actually improve performance, nor that it would even be
possible to stream-line the core of perl any further.. Hell, C++ is only
going to make the situation worse (performance lost to fun of development,
which is acceptible).  Typically, when the core does nothing more than
handle basic flow, and memory management, you can have a rock-solid kernel.
If you can build out apon this (all op codes seem identical, beit built-in
through statically compiled extensions, or linked in), then you will have a
much more solid development base.  This is the advantage of C, python, java,
etc.  Their extensions feel just like built-ins.  Granted, perl has come a
long way with giving keyword power to user-defines.

> Furthermore,  It's
> been demonstrated that this supposed "gain", including in size, is
> negligible.  I have yet to see one iota of justification for denuding
> perl of its math functions.

I would agree that math function should not be compromised.  If dynamically
linking them seriously decreases performance, then we're going to see a
massive slow-down in operation, since some crazed programmers actually like
performing fast fourier transforms.  Getting a 20-30% slow-down is not going
to be well accepted (I'm guessing at the over-head of dynamic linking;  I
just hear how bad it is).  As above, however, I don't see a problem with
keeping them statically linked, so long as they're not part of the core.

> If math functions should be removed,
> then so too string functions.  And that binmode silliness is in
> line before any of them, since it's absolutely useless because I
> never use it.

As in all things in life.. Moderation.. Profiling should decide what
compromises clean-ness of code for the benifit of optimization.  Many string
operations are close to memory management, such as array modification,
string concatenation, simple assignment, references.. I would advocate that
they be part of the memory management system, and thus part of the core.

> Perl is *not* fun when it supplies nothing by default, the way C
does(n't).

I hear ya.  Dynamic memory management (garbage collection, dynamic arrays,
etc), powerful string operations, powerful datatypes, etc.  These are what
attract me to perl the most.. Not so much readily available socket
functions.  Sure having java.util.* built into java would make it easier to
use out of the box, but it's more to learn about the internals, especially
since they may change over time.  I can more easily walk the lib-tree and
read pod-files about each library to learn what I can do.

>
> If you want a language that can do nothing by itself, fine, but don't
> call it Perl.  Given these:
>
> * Scalar manipulation
> * Regular expressions and pattern matching

At the very least, built-in reg-ex's allows the compiler to make incredible
optimizations which it simply wouldn't be able to do if it were a function
call:

$str =~ s/${var} xxx \s xxx/ abs( $1 ) /xegso;

would have to be:

$reg_ex = gen_regex_replace( expr => "${var} xxx \s xxx", subs =>
"abs( $1 )", modifiers => "xegso"  );
apply_reg_ex( regex => $reg_ex, str => $str );

It would be non-trivial to get a compiler to efficiently handle this.. Thus,
I don't support any removal of expressions.  I definately don't support
maki

Re: RFC 146 (v1) Remove socket functions from core

2000-08-24 Thread Michael Maraist

I would actually further this sort of activity.
I admire micro-kernel-type systems.  C and Java give you no functions out of
the box.  Keywords are just that, keywords.  I believe python is like this
as well.  The idea being that everything has to come from a module.. This
limits how much a new developer has to learn, and it doesn't pollute the
global name-space pool.  Though I would advocate OO interfaces to
everything, I'm sure that it would defeat a lot of perl's flexibilty, and so
I'd support the simple Exporter method.

>From this, socket, and virtually all IPC methods should go the wayside.
This includes pipes, shell environment ops ( the get and set functions ),
and even the file-io (open, close, and possibly even printf, sprintf).  At
this point, it gets a little hairy, since arguably, everybody wants print
and <>.  I would suppose that most want open as well.  My personal taste is
to use IO::File, but that's just me.
Since these are just hooks into standard C libraries, I don't think that
they really take up any room.   But it is hard to justify keeping some of
these less used features while taking out others (like FORMAT).

-Michael