Re: Advice wanted on parallel processing

2009-03-19 Thread Colin Paul Adams
  == j waldmann waldm...@imn.htwk-leipzig.de writes:

 (I am not updating in place).  The move generator produces a
 new board for each move.

 Well, this is sound design, but current memory managers may not
 be up to it.  If you check the (board) game programming
 literature, you'll find that engine authors take great efforts
 to bypass automatic memory management (do all updates on the
 current board in-place, and write their own malloc/free for
 game tree nodes).

I know it.
In fact I wrote a program for this game about 10 years ago, in C++,
which did all updates in place.
It wasn't very good (although being the only one that implemented the
rules correctly - as far as I could tell - they are very complicated 
- was necessarily the best).
Now I want to have another go in Haskell. Reading about DPH inspired
me to give it a try, although I'm not attempting to use DPH
yet. Probably I was too naive thinking I was going to be able to
exploit parallelism without pain.

 This becomes even more important when trying to exploit
 concurrency.  In theory, that's all very nice (you can evaluate
 independent branches of the game tree in parallel) but in
 practice, you're doomed if your memory allocator/collector is
 still essentially single-threaded (that is, blocks all
 threads).

OK, that's interesting. GHC's collector is still stop-the-world at the
moment, right? I read in the recent paper that it is intended to try
to remedy that soon, in which case I can try again to see if that
improves matters.
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Advice wanted on parallel processing

2009-03-18 Thread Colin Paul Adams
I've just managed to build ghc 6.11 (Thanks Simon).

I did this for two reasons, one of which is I want to try to improve
the speed of the AI for the Chu Shogi program I am writing by making
use of parallel processing. I have a 4-core Xeon runing Fedora Linux
10 (AMD64).

I have a repeatable scenario for timings. With ghc 6.10.1, no threaded
runtime, the computer takes 51 or 52 seconds to move (experimental
variation up to 1.2 seconds).

With ghc 6.11, it takes the same time.

If I switch on the threaded runtime, then it rises slightly (perhaps
not significantly - I measure 53-54 seconds). With 2, 3 or 4
processors, I measured (one-off, so not reliable) 65, 55 and 58
seconds respectively.

Well, that doesn't really matter, provided I can get the time back
with interest by exploiting parallelism. My first thought for an
easy win is the move generator. At present I have this code:

-- | Generate all legal moves from state.
--
-- The best moves (according to some heuristic) should come first
generate_moves :: (Non_interactive_state, Maybe Move) - 
[(Non_interactive_state, Maybe Move)]
generate_moves (state, _) =
let colour = case is_black_to_play state of
   True - Black
   False - White
pieces' = all_pieces_of_colour (board state) colour
unsorted = concatMap (generate_moves_for_piece state) pieces'
sorted = sortBy best_move unsorted
moves = mapMaybe new_move sorted
in {-trace (concat (intersperse \n (map (print_move . fromJust . snd) 
moves)))-} moves
where new_move mv = 
  case update_from_move (state, Nothing) mv of
Nothing - Nothing
Just (st', Nothing) - Just (st', Just mv)


and my idea was to run the call to generate_moves_for_piece in
parallel over the pieces' list. I thought I could do that simply by
replacing concatMap with parFlatMap, but discovered I need to supply
an extra argument - a Strategy. Not knowing what this should be
(where's the best place to read up on this?) I tried specifying rwhnf.

My timings then are:

-N1 83 seconds
-N2 57 seconds
-N3 58 seconds
-N4 60 seconds

so a complete failure.

Where should I go from here?
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Daniel Fischer
Am Mittwoch, 18. März 2009 15:28 schrieb Colin Paul Adams:
 I've just managed to build ghc 6.11 (Thanks Simon).

 I did this for two reasons, one of which is I want to try to improve
 the speed of the AI for the Chu Shogi program I am writing by making
 use of parallel processing. I have a 4-core Xeon runing Fedora Linux
 10 (AMD64).

 I have a repeatable scenario for timings. With ghc 6.10.1, no threaded
 runtime, the computer takes 51 or 52 seconds to move (experimental
 variation up to 1.2 seconds).

 With ghc 6.11, it takes the same time.

 If I switch on the threaded runtime, then it rises slightly (perhaps
 not significantly - I measure 53-54 seconds). With 2, 3 or 4
 processors, I measured (one-off, so not reliable) 65, 55 and 58
 seconds respectively.

 Well, that doesn't really matter, provided I can get the time back
 with interest by exploiting parallelism. My first thought for an
 easy win is the move generator. At present I have this code:

 -- | Generate all legal moves from state.
 --
 -- The best moves (according to some heuristic) should come first
 generate_moves :: (Non_interactive_state, Maybe Move) -
 [(Non_interactive_state, Maybe Move)] generate_moves (state, _) =
 let colour = case is_black_to_play state of
True - Black
False - White
 pieces' = all_pieces_of_colour (board state) colour
 unsorted = concatMap (generate_moves_for_piece state) pieces'
 sorted = sortBy best_move unsorted
 moves = mapMaybe new_move sorted
 in {-trace (concat (intersperse \n (map (print_move . fromJust . snd)
 moves)))-} moves where new_move mv =
   case update_from_move (state, Nothing) mv of
 Nothing - Nothing
 Just (st', Nothing) - Just (st', Just mv)


 and my idea was to run the call to generate_moves_for_piece in
 parallel over the pieces' list. I thought I could do that simply by
 replacing concatMap with parFlatMap, but discovered I need to supply
 an extra argument - a Strategy. Not knowing what this should be
 (where's the best place to read up on this?) I tried specifying rwhnf.

Control.Parallel.Strategies, docs and source?

generate_moves_for_piece produces a list. rwhnf forces this list enough to see 
if it's [] or (_:_)
(rwhnf x = x `seq` ()), that doesn't get enough work done in each thread to 
compensate the overhead. Try using rnf after writing an NFData instance for 
Move.

If e.g.

data Move = Move {from :: Position, to :: Position}

, the instance would be

instance NFData Move where
rnf (Move f t) = rnf f `seq` rnf t `seq` ()

That might require NFData instances for Position and its components, but 
specifying these should be automatic.


 My timings then are:

 -N1 83 seconds
 -N2 57 seconds
 -N3 58 seconds
 -N4 60 seconds

 so a complete failure.

 Where should I go from here?

HTH,
Daniel

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Colin Paul Adams
 Daniel == Daniel Fischer daniel.is.fisc...@web.de writes:

Daniel generate_moves_for_piece produces a list. rwhnf forces
Daniel this list enough to see if it's [] or (_:_) (rwhnf x = x
Daniel `seq` ()), that doesn't get enough work done in each
Daniel thread to compensate the overhead. Try using rnf after
Daniel writing an NFData instance for Move.

Daniel If e.g.

Daniel data Move = Move {from :: Position, to :: Position}

Daniel , the instance would be

Daniel instance NFData Move where rnf (Move f t) = rnf f `seq`
Daniel rnf t `seq` ()

It made no difference to the timings whatsoever.

Anyway, at that point I decided to revert to the first rule of
performance tuning, and profiled the program for time. The move
generator was using less than 3% of the cpu time, so that was clearly
a waste of time on my part.

The one routine that stood out was this one (about 35% CPU time, with
0% attributed to children):


-- | Value of one rank of the board
rank_value :: (Int, Array Int Square) - Int
rank_value (rank_coord, rank') = sum (map (cell_value rank_coord)
   (assocs rank'))

The array has 12 elements.

So I tried changing the map to parMap rnf. This gave timings of:

 -N1 93 seconds -N2 105 seconds -N3 206 seconds

at which point I decided I hadn't got a clue what was going on.
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Brandon S. Allbery KF8NH

On 2009 Mar 18, at 16:34, Colin Paul Adams wrote:

The one routine that stood out was this one (about 35% CPU time, with
0% attributed to children):


-- | Value of one rank of the board
rank_value :: (Int, Array Int Square) - Int
rank_value (rank_coord, rank') = sum (map (cell_value rank_coord)
  (assocs rank'))

The array has 12 elements.



How many times do you call it?  Perhaps the real optimization you need  
is to memoize.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Colin Paul Adams
 Brandon == Brandon S Allbery KF8NH allb...@ece.cmu.edu writes:

 The array has 12 elements.

Brandon How many times do you call it?  Perhaps the real
Brandon optimization you need is to memoize.

Very many times indeed. But it is a different array on most occasions
(I am not updating in place).

It represents one rank of the board on which the game is played. The
move generator produces a new board for each move.
-- 
Colin Adams
Preston Lancashire
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread Brandon S. Allbery KF8NH

On 2009 Mar 18, at 16:59, Colin Paul Adams wrote:
Brandon == Brandon S Allbery KF8NH allb...@ece.cmu.edu  
writes:



The array has 12 elements.


   Brandon How many times do you call it?  Perhaps the real
   Brandon optimization you need is to memoize.

Very many times indeed. But it is a different array on most occasions
(I am not updating in place).

It represents one rank of the board on which the game is played. The
move generator produces a new board for each move.



It might be helpful for you to show more of the program.

One thing to keep in mind about parallelization is that the level at  
which happens matters:  if individual calculations across the list are  
fast, all you gain by parallelizing it is MP synchronization/locking  
overhead.  On the other hand, it's possible that if the caller can be  
rearranged so that rank_value is computed on another CPU while it's  
doing something else, you could gain quite a bit.  Or maybe the caller  
is itself at the right level to parallelize.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Advice wanted on parallel processing

2009-03-18 Thread j.waldmann



 (I am not updating in place).
 The move generator produces a new board for each move.

Well, this is sound design, but current memory managers may not be up to it. 
If you check the (board) game programming literature, 
you'll find that engine authors take great efforts to bypass automatic
memory management  
(do all updates on the current board in-place, and write their own
malloc/free for game tree nodes).

This becomes even more important when trying to exploit concurrency.
In theory, that's all very nice (you can evaluate independent branches of
the game tree
in parallel) but in practice, you're doomed if your memory
allocator/collector 
is  still essentially single-threaded (that is, blocks all threads).

That's language independent (it's a property of the run-time system).
Of course in a more declarative language it should be easier for the
compiler 
to analyze the source code and replace malloc/free by something better
(no allocation by immediate re-use, or easy deallocation by some stack
regime).

J.W.

-- 
View this message in context: 
http://www.nabble.com/Advice-wanted-on-parallel-processing-tp22580444p22591720.html
Sent from the Haskell - Glasgow-haskell-users mailing list archive at 
Nabble.com.

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users