Re: [Factor-talk] Naive loop optimization

John Benediktsson Fri, 06 Feb 2015 09:17:26 -0800

>
> 1) are specialized arrays the way to go? ie. do the nth & set-nth words
> generate memory access as efficient as with pointer dereferencing in C?
>


Specialized arrays can be useful because they store untagged floats, which
can take up less memory than a tagged float in a normal factor array.
Basically a specialized array is just a length and pointer to memory.  If
you are operating on arrays, perhaps think about doing in-place operations
(``map!`` instead of ``map``) and see if that reduces memory pressure.
Optimizations here will be similar to other languages once you understand
the implications of certain words (i.e. allocations).

2) in a loop over such an array, is it correct to assume that words like
> lin-osc3 to be inlined? (when they are marked inline)
>

Inlining accomplishes a few things, one might be compactness of code, but
more useful for Factor would be for type inference (if the inputs are known
in the calling word, then inlining would allow those types to propagate).
Specifying types directly (either through the "typed" vocabulary, the
"hints" vocabulary, the "declare" method, or using unsafe words) makes it
less necessary mark words inline.

If you want some help speeding things up, feel free to paste something
(even a more complex example).

The compiler allows access to various information about what's happening,
so you can see the difference between using words that specialize on types,
do bounds checking, etc.:

    IN: scratchpad USE: compiler.tree.debugger

    IN: scratchpad [ first ] optimized.
    [ 0 swap nth ]

    IN: scratchpad [ { array } declare first ] optimized.
    [
        0 swap 2dup 1 slot fixnum< [ \ t ] [ f ] if
        [ nip ] [ bounds-error ] if >R R> 2 slot
    ]

    IN: scratchpad [ { array } declare first-unsafe ] optimized.
    [ >R R> 2 slot ]

And you can ``disassemble`` to see machine code, that last one is pretty
efficient:

    IN: scratchpad [ { array } declare first-unsafe ] disassemble
    000000010fe79d20: 8905da7218ff  mov [rip-0xe78d26], eax
    000000010fe79d26: 498b0e        mov rcx, [r14]
    000000010fe79d29: 488b490e      mov rcx, [rcx+0xe]
    000000010fe79d2d: 49890e        mov [r14], rcx
    000000010fe79d30: 8905ca7218ff  mov [rip-0xe78d36], eax
    000000010fe79d36: c3            ret
    000000010fe79d37: 0000          add [rax], al
    000000010fe79d39: 0000          add [rax], al
    000000010fe79d3b: 0000          add [rax], al
    000000010fe79d3d: 0000          add [rax], al
    000000010fe79d3f: 00            invalid





>
> On Fri, Feb 6, 2015 at 4:53 PM, John Benediktsson <mrj...@gmail.com>
> wrote:
>
>> Some thoughts for you:
>>
>> No, ``dup`` does not do anything but duplicate essentially a pointer to
>> the object.
>>
>> Part of the reason it is slow is that you are operating on a kind of box
>> by keeping your { x y } pairs in arrays (and in some cases unboxing
>> ``first2`` and re-boxing ``2array``).  Each of the "math" words (``v*n``,
>> ``v+``, ``v/n``, etc.) also do the same.  They aren't in-place operations,
>> so always allocating memory.
>>
>> In addition to ``time``, you can also ``profile`` your program:
>>
>>     IN: scratchpad gc [ bench1 ] profile
>>     Running time: 2.656065607 seconds
>>
>>     IN: scratchpad flat profile.
>>     depth   time ms  GC %  JIT %  FFI %   FT %
>>        0    2657.0   5.91   0.00  17.69   0.00 T{ thread f "Listener"
>> ~curry~ ~quotation~ 39 ~box~ f t f H{ } f...
>>        0    2656.0   5.87   0.00  17.66   0.00   bench1
>>        0    2655.0   5.88   0.00  17.66   0.00   step
>>        0     433.0  18.01   0.00  18.24   0.00   *
>>        0     430.0  18.14   0.00  90.70   0.00   <array>
>>        0     363.0   0.00   0.00   0.00   0.00   M\ array nth-unsafe
>>        0     275.0   0.00   0.00   0.00   0.00   /
>>        0     232.0   0.00   0.00   0.00   0.00   <
>>        0     194.0   0.00   0.00   0.00   0.00   +
>>        0     178.0   0.00   0.00   0.00   0.00   M\ array length
>>        0     141.0   0.00   0.00   0.00   0.00   M\ array set-nth-unsafe
>>        0     140.0   0.00   0.00   0.00   0.00   M\ sequence nth
>>        0     113.0   0.00   0.00   0.00   0.00   M\ integer bounds-check?
>>        0     104.0   0.00   0.00   0.00   0.00   M\ fixnum integer>fixnum
>>
>> Here's a couple ideas for speeding it up.
>>
>> You can "inline" all the math, so that you operate on ``x`` and ``y``,
>> not ``{ x y }``, avoiding all the array accesses and mallocs.
>>
>>     : lin-osc2 ( x y -- x1 y1 )
>>         2dup                     ! x y x y
>>         swap                     ! x y y x
>>         -1.0 *                   ! x y y -x
>>         [ 0.01 * ] bi@           ! x y dx*dt dy*dt
>>         [ + ] bi-curry@ bi*      ! x1 y1
>>         2dup [ absq ] bi@ + sqrt ! x1 y1 norm
>>         [ / ] curry bi@ ; inline
>>
>>     : bench2 ( -- x y )
>>         1.0 0.0 10,000,000 [ lin-osc2 ] times ;
>>
>> Note: you have to return something in "bench" or too much gets
>> "optimized".
>>
>>     IN: scratchpad gc [ bench2 ] time
>>     Running time: 0.213926035 seconds
>>
>> Sometimes it might be easier for you to see the flow of math with locals
>> (doesn't affect performance):
>>
>>     :: lin-osc3 ( x y -- x' y' )
>>         0.01 :> dt
>>         y :> dx
>>         x neg :> dy
>>
>>         x dx dt * + :> x1
>>         y dy dt * + :> y1
>>
>>         x1 y1 [ absq ] bi@ + sqrt :> norm
>>
>>         x1 norm /
>>         y1 norm / ; inline
>>
>>     : bench3 ( -- {x,y} )
>>         1.0 0.0 10,000,000 [ lin-osc3 ] times 2array ;
>>
>> You can look into using things like the "typed" vocabulary, although
>> because of the way we are inlining the inputs above, it should already
>> "know" that it is operating on floats.
>>
>> The "typed" vocabulary checks inputs against known types.  If you don't
>> want to slow down to do that, you can just declare your types (its unsafe
>> and in a private vocabulary because if you declare the wrong type you can
>> hard crash and we don't it to be mis-used):
>>
>>     { float float } declare
>>
>> Anyway, hope that helps, and sorry for the spam on the paste site,
>> reCAPTCHA isn't what it used to be for keeping away the bad robots.
>>
>> Best,
>> John.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Feb 6, 2015 at 4:07 AM, Marmaduke Woodman <mmwood...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I've attempted to write a perhaps naive ODE integration loop in Factor,
>>>
>>>   http://paste.factorcode.org/paste?id=3428
>>>
>>> but it seems quite slow: the `bench1` word reports running time of ~3 s,
>>> which is an order of magnitude off equivalent OCaml & Haskell, so I imagine
>>> due to my lack of Factor experience there's boxing, unboxing and mixing of
>>> types leading to poor performance.
>>>
>>> Are there some general principles for writing performant numerical code
>>> in Factor? Do the generic sequence words get optimized or explicit use of
>>> unsafe words are required?
>>>
>>> A specific question about `dup` on container types: are the underlying
>>> data duplicated?
>>>
>>> If I've missed any potential reading matériels on this, refs would be
>>> much appreciated.
>>>
>>> Cheers,
>>> Marmaduke
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>> your
>>> hub for all things parallel software development, from weekly thought
>>> leadership blogs to news, videos, case studies, tutorials and more. Take
>>> a
>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>> _______________________________________________
>>> Factor-talk mailing list
>>> Factor-talk@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/factor-talk
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is
>> your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Factor-talk mailing list
>> Factor-talk@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/factor-talk
>>
>>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk
>
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Naive loop optimization

Reply via email to