Advice wanted on parallel processing

2009-03-18 Thread Colin Paul Adams
I've just managed to build ghc 6.11 (Thanks Simon).

I did this for two reasons, one of which is I want to try to improve
the speed of the AI for the Chu Shogi program I am writing by making
use of parallel processing. I have a 4-core Xeon runing Fedora Linux
10 (AMD64).

I have a repeatable scenario for timings. With ghc 6.10.1, no threaded
runtime, the computer takes 51 or 52 seconds to move (experimental
variation up to 1.2 seconds).

With ghc 6.11, it takes the same time.

If I switch on the threaded runtime, then it rises slightly (perhaps
not significantly - I measure 53-54 seconds). With 2, 3 or 4
processors, I measured (one-off, so not reliable) 65, 55 and 58
seconds respectively.

Well, that doesn't really matter, provided I can get the time back
with interest by exploiting parallelism. My first thought for an
"easy" win is the move generator. At present I have this code:

-- | Generate all legal moves from state.
--
-- The best moves (according to some heuristic) should come first
generate_moves :: (Non_interactive_state, Maybe Move) -> 
[(Non_interactive_state, Maybe Move)]
generate_moves (state, _) =
let colour = case is_black_to_play state of
   True -> Black
   False -> White
pieces' = all_pieces_of_colour (board state) colour
unsorted = concatMap (generate_moves_for_piece state) pieces'
sorted = sortBy best_move unsorted
moves = mapMaybe new_move sorted
in {-trace (concat (intersperse "\n" (map (print_move . fromJust . snd) 
moves)))-} moves
where new_move mv = 
  case update_from_move (state, Nothing) mv of
Nothing -> Nothing
Just (st', Nothing) -> Just (st', Just mv)


and my idea was to run the call to generate_moves_for_piece in
parallel over the pieces' list. I thought I could do that simply by
replacing concatMap with parFlatMap, but discovered I need to supply
an extra argument - a Strategy. Not knowing what this should be
(where's the best place to read up on this?) I tried specifying rwhnf.

My timings then are:

-N1 83 seconds
-N2 57 seconds
-N3 58 seconds
-N4 60 seconds

so a complete failure.

Where should I go from here?
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Daniel Fischer
Am Mittwoch, 18. März 2009 15:28 schrieb Colin Paul Adams:
> I've just managed to build ghc 6.11 (Thanks Simon).
>
> I did this for two reasons, one of which is I want to try to improve
> the speed of the AI for the Chu Shogi program I am writing by making
> use of parallel processing. I have a 4-core Xeon runing Fedora Linux
> 10 (AMD64).
>
> I have a repeatable scenario for timings. With ghc 6.10.1, no threaded
> runtime, the computer takes 51 or 52 seconds to move (experimental
> variation up to 1.2 seconds).
>
> With ghc 6.11, it takes the same time.
>
> If I switch on the threaded runtime, then it rises slightly (perhaps
> not significantly - I measure 53-54 seconds). With 2, 3 or 4
> processors, I measured (one-off, so not reliable) 65, 55 and 58
> seconds respectively.
>
> Well, that doesn't really matter, provided I can get the time back
> with interest by exploiting parallelism. My first thought for an
> "easy" win is the move generator. At present I have this code:
>
> -- | Generate all legal moves from state.
> --
> -- The best moves (according to some heuristic) should come first
> generate_moves :: (Non_interactive_state, Maybe Move) ->
> [(Non_interactive_state, Maybe Move)] generate_moves (state, _) =
> let colour = case is_black_to_play state of
>True -> Black
>False -> White
> pieces' = all_pieces_of_colour (board state) colour
> unsorted = concatMap (generate_moves_for_piece state) pieces'
> sorted = sortBy best_move unsorted
> moves = mapMaybe new_move sorted
> in {-trace (concat (intersperse "\n" (map (print_move . fromJust . snd)
> moves)))-} moves where new_move mv =
>   case update_from_move (state, Nothing) mv of
> Nothing -> Nothing
> Just (st', Nothing) -> Just (st', Just mv)
>
>
> and my idea was to run the call to generate_moves_for_piece in
> parallel over the pieces' list. I thought I could do that simply by
> replacing concatMap with parFlatMap, but discovered I need to supply
> an extra argument - a Strategy. Not knowing what this should be
> (where's the best place to read up on this?) I tried specifying rwhnf.

Control.Parallel.Strategies, docs and source?

generate_moves_for_piece produces a list. rwhnf forces this list enough to see 
if it's [] or (_:_)
(rwhnf x = x `seq` ()), that doesn't get enough work done in each thread to 
compensate the overhead. Try using rnf after writing an NFData instance for 
Move.

If e.g.

data Move = Move {from :: Position, to :: Position}

, the instance would be

instance NFData Move where
rnf (Move f t) = rnf f `seq` rnf t `seq` ()

That might require NFData instances for Position and its components, but 
specifying these should be automatic.

>
> My timings then are:
>
> -N1 83 seconds
> -N2 57 seconds
> -N3 58 seconds
> -N4 60 seconds
>
> so a complete failure.
>
> Where should I go from here?

HTH,
Daniel

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Colin Paul Adams
> "Daniel" == Daniel Fischer  writes:

Daniel> generate_moves_for_piece produces a list. rwhnf forces
Daniel> this list enough to see if it's [] or (_:_) (rwhnf x = x
Daniel> `seq` ()), that doesn't get enough work done in each
Daniel> thread to compensate the overhead. Try using rnf after
Daniel> writing an NFData instance for Move.

Daniel> If e.g.

Daniel> data Move = Move {from :: Position, to :: Position}

Daniel> , the instance would be

Daniel> instance NFData Move where rnf (Move f t) = rnf f `seq`
Daniel> rnf t `seq` ()

It made no difference to the timings whatsoever.

Anyway, at that point I decided to revert to the first rule of
performance tuning, and profiled the program for time. The move
generator was using less than 3% of the cpu time, so that was clearly
a waste of time on my part.

The one routine that stood out was this one (about 35% CPU time, with
0% attributed to children):


-- | Value of one rank of the board
rank_value :: (Int, Array Int Square) -> Int
rank_value (rank_coord, rank') = sum (map (cell_value rank_coord)
   (assocs rank'))

The array has 12 elements.

So I tried changing the map to parMap rnf. This gave timings of:

>> -N1 93 seconds -N2 105 seconds -N3 206 seconds

at which point I decided I hadn't got a clue what was going on.
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Brandon S. Allbery KF8NH

On 2009 Mar 18, at 16:34, Colin Paul Adams wrote:

The one routine that stood out was this one (about 35% CPU time, with
0% attributed to children):


-- | Value of one rank of the board
rank_value :: (Int, Array Int Square) -> Int
rank_value (rank_coord, rank') = sum (map (cell_value rank_coord)
  (assocs rank'))

The array has 12 elements.



How many times do you call it?  Perhaps the real optimization you need  
is to memoize.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Colin Paul Adams
> "Brandon" == Brandon S Allbery KF8NH  writes:

>> The array has 12 elements.

Brandon> How many times do you call it?  Perhaps the real
Brandon> optimization you need is to memoize.

Very many times indeed. But it is a different array on most occasions
(I am not updating in place).

It represents one rank of the board on which the game is played. The
move generator produces a new board for each move.
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Brandon S. Allbery KF8NH

On 2009 Mar 18, at 16:59, Colin Paul Adams wrote:
"Brandon" == Brandon S Allbery KF8NH   
writes:



The array has 12 elements.


   Brandon> How many times do you call it?  Perhaps the real
   Brandon> optimization you need is to memoize.

Very many times indeed. But it is a different array on most occasions
(I am not updating in place).

It represents one rank of the board on which the game is played. The
move generator produces a new board for each move.



It might be helpful for you to show more of the program.

One thing to keep in mind about parallelization is that the level at  
which happens matters:  if individual calculations across the list are  
fast, all you gain by parallelizing it is MP synchronization/locking  
overhead.  On the other hand, it's possible that if the caller can be  
rearranged so that rank_value is computed on another CPU while it's  
doing something else, you could gain quite a bit.  Or maybe the caller  
is itself at the right level to parallelize.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread j.waldmann



> (I am not updating in place).
> The move generator produces a new board for each move.

Well, this is sound design, but current memory managers may not be up to it. 
If you check the (board) game programming literature, 
you'll find that engine authors take great efforts to bypass automatic
memory management  
(do all updates on the "current" board in-place, and write their own
malloc/free for game tree nodes).

This becomes even more important when trying to exploit concurrency.
In theory, that's all very nice (you can evaluate independent branches of
the game tree
in parallel) but in practice, you're doomed if your memory
allocator/collector 
is  still essentially single-threaded (that is, blocks all threads).

That's language independent (it's a property of the run-time system).
Of course in a more declarative language it should be easier for the
compiler 
to analyze the source code and replace malloc/free by something better
(no allocation by immediate re-use, or easy deallocation by some stack
regime).

J.W.

-- 
View this message in context: 
http://www.nabble.com/Advice-wanted-on-parallel-processing-tp22580444p22591720.html
Sent from the Haskell - Glasgow-haskell-users mailing list archive at 
Nabble.com.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Colin Paul Adams
> ">" == j waldmann  writes:

>> (I am not updating in place).  The move generator produces a
>> new board for each move.

>> Well, this is sound design, but current memory managers may not
>> be up to it.  If you check the (board) game programming
>> literature, you'll find that engine authors take great efforts
>> to bypass automatic memory management (do all updates on the
>> "current" board in-place, and write their own malloc/free for
>> game tree nodes).

I know it.
In fact I wrote a program for this game about 10 years ago, in C++,
which did all updates in place.
It wasn't very good (although being the only one that implemented the
rules correctly - as far as I could tell - they are very complicated 
- was necessarily the best).
Now I want to have another go in Haskell. Reading about DPH inspired
me to give it a try, although I'm not attempting to use DPH
yet. Probably I was too naive thinking I was going to be able to
exploit parallelism without pain.

>> This becomes even more important when trying to exploit
>> concurrency.  In theory, that's all very nice (you can evaluate
>> independent branches of the game tree in parallel) but in
>> practice, you're doomed if your memory allocator/collector is
>> still essentially single-threaded (that is, blocks all
>> threads).

OK, that's interesting. GHC's collector is still stop-the-world at the
moment, right? I read in the recent paper that it is intended to try
to remedy that soon, in which case I can try again to see if that
improves matters.
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Colin Paul Adams
> "Brandon" == Brandon S Allbery KF8NH  writes:

Brandon> On 2009 Mar 18, at 16:59, Colin Paul Adams wrote:
>>> "Brandon" == Brandon S Allbery KF8NH 
>>> writes:
>> 
 The array has 12 elements.
>> 
Brandon> How many times do you call it?  Perhaps the real
Brandon> optimization you need is to memoize.
>> 
>> Very many times indeed. But it is a different array on most
>> occasions (I am not updating in place).
>> 
>> It represents one rank of the board on which the game is
>> played. The move generator produces a new board for each move.


Brandon> It might be helpful for you to show more of the program.

I'm not sure which bits to show that might be revealing.
The whole thing can be seen by 

darcs pull http://code.haskell.org/srv/code/chu-shogi/

if you are particularly interested. Board.hs is where rank_value is to
be found. You would also need

darcs pull http://code.haskell.org/srv/code/game-tree/

Brandon> One thing to keep in mind about parallelization is that
Brandon> the level at which happens matters: if individual
Brandon> calculations across the list are fast, all you gain by

Profiling shows that nearly all the time is spent in evaluating the
position.
I'm doing that (at the moment) by evaluating each square in isolation
(that might change in the future), then summing each rank, and then
summing the ranks.

Brandon> parallelizing it is MP synchronization/locking overhead.
Brandon> On the other hand, it's possible that if the caller can
Brandon> be rearranged so that rank_value is computed on another
Brandon> CPU while it's doing something else, you could gain quite
Brandon> a bit.  Or maybe the caller is itself at the right level
Brandon> to parallelize.

So at first glance it looked as if the calculation for each rank (and
indeed each cell) should be able to proceed completly independently of
each other.

But the profile shows no time allocated to the children of rank_value
(although cell_value, which is called by it, uses a lot of time). So
my intution seems defeated. perhaps by code-rewriting, although I
can't imagine how.
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users