Re: [Factor-talk] asm intrinsics for speed

Phil Dawes Wed, 03 Sep 2008 07:31:31 -0700

Hi Slava,

Thanks very much for your mail and for taking a look at this - I'm 
looking forward to investigating the new optimizer tonight.


(N.B. I wrote the rest of this mail last night so it doesn't take into 
account your last email.)
----

As far as I can see GC doesn't get called within a basic block, so as an 
experiment I re-implemented count-items using some custom intrinsics 
words which effectively unbox the variables involved:

untag-fixnum
unbox-alien
fetch32-unboxed
+4-unboxed

I also used words with existing intrinsics: eq?

I was able to get down to 338ms for count-items after these mods. For 
reference the optimized dataflow is:

( scratchpad ) \ count-items f optimized-word.
[
     untag-fixnum 0 swap
     ( #shuffle: @4928410 0 @4928406 -- 0 @4928410 @4928406 )
     >r mmap>from,to
     >r unbox-alien r>
     unbox-alien r>
     \ ( gensym ) [
         pick pick fixnum< [
             swap
             >r 2dup
             >r
             >r swap fetch32 eq? 1 0 rot [ drop ] [ nip ] if
             fixnum+fast r>
             r>
             r>
             swap
             >r
             >r +4-unboxed r>
             r>
             ( gensym )
         ] [ 3drop ] if
     ] call
]

To get an idea of how the times break down:

Empty loop with no fetching: 99ms
   ( i.e.looping 12495752 times - same speed as 12495752 [ drop ] each )

Empty loop fetching each unsigned-int with fetch32-unboxed: 214ms
  (i.e. looping and fetching 12495752 times)

Full count-items: 338ms

(all these on 32bit x86 with the processor limited to 800mhz, working on 
a 48MB file). If anybody is interested in the code I can post it up 
somewhere.

Cheers,

Phil


Slava Pestov wrote:
> Hi Phil,
> 
> I found the time to look at your code and thought about what's going on here.
> 
> Here are the main sources of overhead, as far as I can tell:
> 
> - The accumulating integer for the count cannot be proven to be
> fixnum. This would be the case even if the loop count was known to be
> a fixnum (which it isn't, in your case). Consider something like this:
> 
> { fixnum } declare 0 swap [ 1+ ] times
> 
> Even though the counter increments in lock-step with the internal
> counter of 'times', which is known to be bounded by a fixnum, the
> optimizer doesn't notice this and produces the following code:
> 
> [
>     0 swap 0 swap \ ( gensym ) [
>         over over fixnum< [
>             >r
>             >r 1 +-integer-fixnum r>
>             r>
>             >r 1 fixnum+fast r>
>             ( gensym )
>         ] [ 2drop ] if
>     ] label
> ]
> 
> 1 +-integer-fixnum involves a subroutine call and two conditional
> branches and is not as fast as 1 fixnum+fast which is essentially a
> single instruction.
> 
> A similar situation occurs with a 'count' combinator. The counter is
> always less than or equal to the loop induction variable, so if the
> latter is a fixnum the counter is too. The optimizer needs to realize
> this fact.
> 
> While the required induction variable analysis might not appear for a
> while, the new code generator that's planned will make
> +-integer-fixnum more efficient.
> 
> - The other problem is that on 32-bit platforms, there is no inline
> intrinsic for alien-unsigned-4 and so a call into the VM is generated,
> which is very expensive.
> 
> - Even if the code generator could produce inline assembly for
> alien-unsigned-4, it wouldn't in your above code, because it cannot
> assume that address>> outputs a simple-alien and length>> outputs a
> fixnum.
> 
> So the bad news is that for this given problem, it won't get much
> faster than it is now until I work on the compiler some more.
> 
> The good news is that thanks to the new optimizer's tuple unboxing,
> ity is possible to implement an algorithm with almost identical
> performance but without re-implementing parts of sequences or using
> unsafe words:
> 
> TUPLE: int-array { alien read-only } { length read-only } ;
> 
> : <int-array> ( alien bytes -- int-array )
>     "int" heap-size /i int-array boa ; inline
> 
> M: int-array length length>> ;
> 
> M: int-array nth-unsafe alien>> int-nth ;
> 
> INSTANCE: int-array immutable-sequence
> 
> : count-items ( alien bytes item -- n )
>     [ <int-array> ] dip [ = ] curry count ;
> 
> HINTS: count-items simple-alien fixnum fixnum ;
> 
> : with-whole-mapped-file ( fname quot -- )
>     [ dup file-info size>> ] dip with-mapped-file ; inline
> 
> : test-count-items ( fname val -- count )
>     [
>         [ [ address>> ] [ length>> ] bi ] dip count-items
>     ] curry with-whole-mapped-file ;
> 
> The key here is that the <int-array> gets unboxed because its
> allocation never escapes, and so we end up with essentially the same
> code that you started with.
> 
> On 64-bit platforms, the alien-signed-4 call (in the int-nth inline
> word) should be open-coded as an assembler intrinsic, because we
> explicitly declared that the alien is a simple-alien and that the
> length is a fixnum in hints, and this would improve performance.
> 
> Note that we didn't have to declare types on the tuple slots; they
> propagate from the HINTS: through the constructor and out of the
> accessors. The read-only declaration is essential though, this is
> required in order for tuple slot types to propagate and for unboxing
> to take place.
> 
> Finally, I added a new benchmark, extra/benchmark/dawes (I hope you
> don't mind the name ;-), as a reminder to myself that I need to work
> on this in the future.
> 
> Slava
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Factor-talk mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/factor-talk
> 



-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Factor-talk mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] asm intrinsics for speed

Reply via email to