I haven't had enough time to polish a paper about the subject, so I decided to 
post my results here, in Haskell Café.

When Simon Peyton-Jones was in Moscow about a month ago I made a bold statement 
that Parallel Haskell programs, expressed with the help of par and pseq, can be 
transformed into Nested Data Parallel Haskell.

I wrote a simple model of what will be if we group arguments of par and pseq 
into queues based on parallel arrays from NDPH. We could then evaluate all 
thunks from that queues in parallel using SIMD (or just getting higher ILP). We 
also do not have to lock out main spark queue, we lock only queue we should put 
argument in. The latter turned out to be beneficial in itself, at least in my 
simple model.

Let us look at famous parallel fib function:
    Fib n
        | N <= 1 = 1
        | otherwise = a `par` b `pseq` a+b
        where
            a = fib (n-1)
            b = fib (n-2)

When we evaluate fib we put a spark of a into spark queue and proceed to 
evaluate b and, then, a+b. When we put a into spark queue, we lock queue out, 
modify and release. There is a big probability that threads on different cores 
will compete for queue lock and some of them (most of them) will waste time 
waiting for lock release.

If we group a's into different queue and put to main spark queue a spark to 
evaluate a complete group of a's at once, we will get less wasted time.

This will work even for single CPU. Below is a run of (fib 15) on my model with 
cpuCount 1, 4 and 16 and a's or b's group length 0 (no grouping) and 16:

    cpuCount  groupLength=0  groupLength=16
              modelTicks     modelTicks
    1         34535          27189
    4         12178          7472
    16        7568           3157
    speedup   2.19 times     3.11 times
    ticks1/ticks16

I think results speak for itself. I think the idea of `par` argument grouping 
could be viable.

I should note that I made several digressions when I wrote model. One of 
digressions is that all evaluations are put into queue. In (a `par` b `pseq` 
a+b) a put into a_queue, b put into b_queue and (a+b) put into main queue. Each 
evaluated spark update it's "parent" - a spark that wait for it.

Also, main loop of single CPU changed from simple (reading main spark queue + 
execute when get something) into a series of attepmts with fall back on failure:
- first read main queue and execute spark if succeed,
- else read current a_queue and execute all sparks there if succeed,
- else read current b_queue and execute all sparks there if succeed,
- else go to main loop.

The new (transformed) code for our fib below:

    -- |Create a new queue based on parallel array. It holds a parallel array 
with current arguments and a function that performs
    -- computation in RTS monad (evaluation function).
    newQueueParArray :: (x -> RTS ()) -> RTSRef ([: x :],x -> RTS ())
    a_queue = unsafePerformIO $ newQueueParArray (\x -> fib (x-1)) -- RTSRef 
(Int,Int -> Int)
    b_queue = unsafePerformIO $ newQueueParArray (\x -> fib (x-2)) -- RTSRef 
(Int,Int -> Int)

    fib n caller
        | n <= 1 = 1
        | otherwise = unsafePerformIO $ do
            Ab <- addToMainQueue (defer (+) caller)
            A' <- addToParArrQueue a_queue x ab
            B' <- addToParArrQueue b_queue x ab
            addToMainQueue ab -- add a spark to check a' and b' evaluation 
status, compute a+b and update the caller.

That transfomation cannot be done at the source level using usual type 
(class/families) hackery. It could be done, though, using core-to-core 
transformations.

It is clear that several values of same type (Int for fib) and a function to 
perform operations over them leads to SIMD execution.

I made some provisions to exploit SIMD and ILP in my model. It can load more 
than single task information and add several values per cycle.

This also speeds execution up, but not so radically (about 3%).

The source code for model is here: 
http://82.146.47.211/attachment/wiki/ndpph/ndp-ph.hs (it's a MskHUG.ru domain, 
we have temporary problems with DNS). It short of comments, but I tried to make 
understandable function names.

Compile it with 'ghc -o ndp-ph --make -O2 ndp-ph' and run with a command line 
like the following:

     ndp-ph.exe cpuCount 1 usePrivateGroups 0 maxABTaskLength 64 
taskFetchsPerCycle 1 thunksPerAddition 8 cyclesPerAddition 1

I decided to create a model of exexcution instead of modifying an existing 
implementation because I have not enough time. I wrote it over evenings and a 
weekend, so it is simple, it's rude and it does the job pretty fine.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to