Hi Slava,
Thanks very much for your mail and for taking a look at this - I'm
looking forward to investigating the new optimizer tonight.
(N.B. I wrote the rest of this mail last night so it doesn't take into
account your last email.)
----
As far as I can see GC doesn't get called within a basic block, so as an
experiment I re-implemented count-items using some custom intrinsics
words which effectively unbox the variables involved:
untag-fixnum
unbox-alien
fetch32-unboxed
+4-unboxed
I also used words with existing intrinsics: eq?
I was able to get down to 338ms for count-items after these mods. For
reference the optimized dataflow is:
( scratchpad ) \ count-items f optimized-word.
[
untag-fixnum 0 swap
( #shuffle: @4928410 0 @4928406 -- 0 @4928410 @4928406 )
>r mmap>from,to
>r unbox-alien r>
unbox-alien r>
\ ( gensym ) [
pick pick fixnum< [
swap
>r 2dup
>r
>r swap fetch32 eq? 1 0 rot [ drop ] [ nip ] if
fixnum+fast r>
r>
r>
swap
>r
>r +4-unboxed r>
r>
( gensym )
] [ 3drop ] if
] call
]
To get an idea of how the times break down:
Empty loop with no fetching: 99ms
( i.e.looping 12495752 times - same speed as 12495752 [ drop ] each )
Empty loop fetching each unsigned-int with fetch32-unboxed: 214ms
(i.e. looping and fetching 12495752 times)
Full count-items: 338ms
(all these on 32bit x86 with the processor limited to 800mhz, working on
a 48MB file). If anybody is interested in the code I can post it up
somewhere.
Cheers,
Phil
Slava Pestov wrote:
> Hi Phil,
>
> I found the time to look at your code and thought about what's going on here.
>
> Here are the main sources of overhead, as far as I can tell:
>
> - The accumulating integer for the count cannot be proven to be
> fixnum. This would be the case even if the loop count was known to be
> a fixnum (which it isn't, in your case). Consider something like this:
>
> { fixnum } declare 0 swap [ 1+ ] times
>
> Even though the counter increments in lock-step with the internal
> counter of 'times', which is known to be bounded by a fixnum, the
> optimizer doesn't notice this and produces the following code:
>
> [
> 0 swap 0 swap \ ( gensym ) [
> over over fixnum< [
> >r
> >r 1 +-integer-fixnum r>
> r>
> >r 1 fixnum+fast r>
> ( gensym )
> ] [ 2drop ] if
> ] label
> ]
>
> 1 +-integer-fixnum involves a subroutine call and two conditional
> branches and is not as fast as 1 fixnum+fast which is essentially a
> single instruction.
>
> A similar situation occurs with a 'count' combinator. The counter is
> always less than or equal to the loop induction variable, so if the
> latter is a fixnum the counter is too. The optimizer needs to realize
> this fact.
>
> While the required induction variable analysis might not appear for a
> while, the new code generator that's planned will make
> +-integer-fixnum more efficient.
>
> - The other problem is that on 32-bit platforms, there is no inline
> intrinsic for alien-unsigned-4 and so a call into the VM is generated,
> which is very expensive.
>
> - Even if the code generator could produce inline assembly for
> alien-unsigned-4, it wouldn't in your above code, because it cannot
> assume that address>> outputs a simple-alien and length>> outputs a
> fixnum.
>
> So the bad news is that for this given problem, it won't get much
> faster than it is now until I work on the compiler some more.
>
> The good news is that thanks to the new optimizer's tuple unboxing,
> ity is possible to implement an algorithm with almost identical
> performance but without re-implementing parts of sequences or using
> unsafe words:
>
> TUPLE: int-array { alien read-only } { length read-only } ;
>
> : <int-array> ( alien bytes -- int-array )
> "int" heap-size /i int-array boa ; inline
>
> M: int-array length length>> ;
>
> M: int-array nth-unsafe alien>> int-nth ;
>
> INSTANCE: int-array immutable-sequence
>
> : count-items ( alien bytes item -- n )
> [ <int-array> ] dip [ = ] curry count ;
>
> HINTS: count-items simple-alien fixnum fixnum ;
>
> : with-whole-mapped-file ( fname quot -- )
> [ dup file-info size>> ] dip with-mapped-file ; inline
>
> : test-count-items ( fname val -- count )
> [
> [ [ address>> ] [ length>> ] bi ] dip count-items
> ] curry with-whole-mapped-file ;
>
> The key here is that the <int-array> gets unboxed because its
> allocation never escapes, and so we end up with essentially the same
> code that you started with.
>
> On 64-bit platforms, the alien-signed-4 call (in the int-nth inline
> word) should be open-coded as an assembler intrinsic, because we
> explicitly declared that the alien is a simple-alien and that the
> length is a fixnum in hints, and this would improve performance.
>
> Note that we didn't have to declare types on the tuple slots; they
> propagate from the HINTS: through the constructor and out of the
> accessors. The read-only declaration is essential though, this is
> required in order for tuple slot types to propagate and for unboxing
> to take place.
>
> Finally, I added a new benchmark, extra/benchmark/dawes (I hope you
> don't mind the name ;-), as a reminder to myself that I need to work
> on this in the future.
>
> Slava
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Factor-talk mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/factor-talk
>
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Factor-talk mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/factor-talk