"Gregor N. Purdy" wrote:

> Michael --
>
> > I had more time to think about it, and I determined how a compute op-code
> > could be efficient.
> >
> > [snip]
>
> You wicked, wicked person! :)
>
> I'd like to see some benchmarks on that one vs. the most efficient
> possible hand-coded separate ops for moderate to complex arithmetic...
> These sorts of ops should be possible without them being in the core
> once we have loadable oplibs, as long as the assembler knows how to
> lay them out...
>
> BTW, the ops should probably be:
>
>   * compute_i_s_v
>   * compute_i_sc_v
>   * compute_n_s_v
>   * compute_n_sc_v
>
> > /** FMT: compute v_arg_size, destination-register, fmt_const_string,
> > ( list of reg-indexes )
>
> The nargs operand is spliced in in place of the 'v'
>
> Regards,

I didn't have the var-args available so I wrote a hack: functions with 4, 8
and 16 register-parameters (from the above) which were just aliases of the
same function which didn't care (because it was based on the format field).  I
wrote a dozen or so variations of the function 'calculate'.  I determined
after a lot of experimenting that there is not much value in providing a
general stack machine sub-op-code-engine (with push, pop, dup, and swap).  The
performance trails off as the number of symbols increases (since 50% of the
operations are push).  My attempts at combining multiple chars in a switch
statement had marginally positive effects when in optimial conditions, but
performed slower in sub-optimal conditions.

I later wrote two variations. One was a semi-simple divde-and-conquor summer:

/* sum Pdest, arg-count, arg1, arg2, ... */
AUTO_OP sum_i_i_i_i_i_i {
  unsigned int cnt = P2 - 1;
  opcode_t* args_offset = cur_opcode + 3;

  int accum = INT_REG( *args_offset++ );

  while(cnt) {
    if ( cnt >= 8 ) {
      accum += INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ );
       cnt -= 8;
    } else if ( cnt >= 4 ) {
      accum += INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ );
       cnt -= 4;
    } else if ( cnt >= 2 ) {
      accum += INT_REG( *args_offset++ ) +
       INT_REG( *args_offset++ );
       cnt -= 2;
    } else {
      accum += INT_REG( *args_offset++ );
       cnt--;
    }
  }
  INT_REG( P1 ) = accum;
}

This performed the best in all tests.  Note that I only went down to n=4;
obviously there would be a performance hit for anything less.  Unfortuntaely
without a format, it's impossible to mix-and-match constants with registers.
Or even integers with numbers.  You'd need a separate sum_int and sum_num.

The other function is a general accumulator-based function which uses a format
to figure out not only the type of operation (add, sub, mul, div,
mod,and,or,xor, land,lor,lxor,not), but also the format (int-reg, int-const,
num-reg, num-const).  By using a single character, performance is maintained
(though readibility suffers somewhat).  What follows is mostly what I used for
my tests.


AUTO_OP compute_i_i_s_i_i_i_i { // note should be I_i_s_v
   STRING *s = Parrot_string_constants[P3];

   opcode_t* args_offset = cur_opcode + 4;

   char * ops = s->bufstart;

   /* do bounds checking */

   int accum = INT_REG( *args_offset++ ); // probably not always what we want,
but costs too much otherwise

   while(*ops) {
     switch( *ops ) {
       case '+':
         accum += INT_REG( *args_offset++ );
         break;
       case '-':
         accum -= INT_REG( *args_offset++ );
         break;
       case '*':
         accum *= INT_REG( *args_offset++ );
         break;
       case '/':
         accum /= INT_REG( *args_offset++ );
         break;
       case '%':
         accum %= INT_REG( *args_offset++ );
         break;

       case 'a':
         accum += *args_offset++;
         break;
       case 's':
         accum -= *args_offset++;
         break;
       case 'm':
         accum *= *args_offset++;
          break;
       case 'd':
         accum /= *args_offset++;
         break;

       case 'A':
         accum += NUM_REG( *args_offset++ );
         break;
       case 'S':
         accum -= NUM_REG( *args_offset++ );
         break;
       case 'M':
         accum *= NUM_REG( *args_offset++ );
         break;
       case 'D':
         accum /= NUM_REG( *args_offset++ );
         break;

       default:
         printf("Unsupported calculate-opcode %c. Exiting\n", *ops);
         exit(-1);
     }
     ops++;
   }
   INT_REG( P2 ) = accum;
}

A summary of the results for 6 million iterations of each of the following on
a 466MHZ celeron on Linux:

// for n = 4    |    for n = 16
sum:    21 seconds  (18% faster)  |    23 seconds ( 78% faster)
format-calculated:    22s  (9% faster)  |  39s ( 47% faster)
manual-adds:    23s    |    63s
null-loop:     12s   // for reference

Note that the %-faster is calculated after subtracting out the null-loop.
Additionally for n=4, the times are too close to the varience to draw any
conclusions.

// pseudo op-codes (assuming auto v_args in assembler)

manual ops:
ADD I1 = I2 + I3
ADD I1 = I1 + I4
ADD I1 = I1 + I5
...

sum op:
sum I1 = I2, I3, I4, I5, ...

calculate op:
calculate I1 = "++++...", I2, I3, I4, I5, ...


Variations can only be of slower operations (such as div), which will tend to
normalize the performance as more and more CPU time goes for the actual
assembly operations (making gains less dramatic)

End result:  I'd say that unless we profile code and find a good deal of
algebra with 8+ terms, I'm not seeing a great value for this stuff.  This
would only be useful if treated especially by the compiler, and since var-args
are being shunned in the core, I suppose we should abondon this sort of
optimization.

-Michael

Reply via email to