Re: [SPAM] Re: [SPAM] Re: [computer-go] Re: A cluster version of Zen is running on cgos 19x19
In your (or Sylvain's?) recent paper, you wrote less than one second interval was useless. I've observed similar. I'm now evaluating the performance with 0.2, 0.4, 1 and 4 second intervals for 5 second per move setting on 19x19 board on 32 nodes of HA8000 cluster. Yes, one second is fine for 5 seconds per move. Maybe you can check if you have a linear speed-up if you artificially simulate a zero communication time ? My guess is that the communication time should not be a trouble, but if you don't use MPI, maybe there's something in your implementation of communications ? By the way, a cluster parallelization in MPI can be developped very quickly and MPI is efficient - mpi_all_reduce has a computational cost logarithmic in the number of nodes. Good luck, Olivier ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [SPAM] Re: [SPAM] Re: [computer-go] Re: A cluster version of Zen is running on cgos 19x19
Olivier Teytaud: aa5e3c330911250005v1d434a5bj8a09067a620ef...@mail.gmail.com: In your (or Sylvain's?) recent paper, you wrote less than one second interval was useless. I've observed similar. I'm now evaluating the performance with 0.2, 0.4, 1 and 4 second intervals for 5 second per move setting on 19x19 board on 32 nodes of HA8000 cluster. Yes, one second is fine for 5 seconds per move. Maybe you can check if you have a linear speed-up if you artificially simulate a zero communication time ? My guess is that the communication time should not be a trouble, but if you don't use MPI, maybe there's something in your implementation of communications ? Hmm, I think my communication code is not a trouble. By the way, a cluster parallelization in MPI can be developped very quickly and MPI is efficient - mpi_all_reduce has a computational cost logarithmic in the number of nodes. Even if the sum-up is done in a logarithmic time (with binary tree style), the collecting time of all infomation from all nodes is proportional to the number of nodes if the master node has few communication ports, isn't it? MPI is a best choice for dedicated HPC clusters, I agree. It forces, however, several constraints such as each node cannot be unplugged or plugged during operation. MPI cannot be installed some computers with not-so-common operating systems or small computers with not enough memory, such as game cosoles. I just want freer parallel and distributed computing environment for MCTS than MPI. My code is now running on a mini pc cluster at my home. I don't want to install MPI to my computers :). By the way, have you experimented not averaging but just adding sceme? When I tested that my code had some bugs and no success. Good luck, Thanks, Hideki -- g...@nue.ci.i.u-tokyo.ac.jp (Kato) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [SPAM] Re: [SPAM] Re: [SPAM] Re: [computer-go] Re: A cluster version of Zen is running on cgos 19x19
Even if the sum-up is done in a logarithmic time (with binary tree style), the collecting time of all infomation from all nodes is proportional to the number of nodes if the master node has few communication ports, isn't it? No (unless I misunderstood what you mean, sorry in that case!) ! Use a tree of nodes, to agregate informations, and everything is logarithmic. This is implicitly done in MPI. If you have 8 nodes A, B, C, D, E, F, G, H, then (i) first layer A and B send information to B C and D send information to D E and F send information to F G and H send information to H (ii) second layer B and D send information to D F and H send information to H (iii) third layer D and H send information to H then do the same in the reverse order so that the cumulated information is sent back to all nodes. By the way, have you experimented not averaging but just adding sceme? When I tested that my code had some bugs and no success. Yes, we have tested. Surprisingly, no significant difference. But I don't know if this would still hold today, as we have some pattern-based exploration. For a code with a score almost only depending on percentages, it's not surprising that averaging and summing are equivalent. Best regards, Olivier ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [SPAM] Re: [SPAM] Re: [SPAM] Re: [computer-go] Re: A cluster version of Zen is running on cgos 19x19
Olivier Teytaud: aa5e3c330911250119x5e01fa32w2e5f3db68704d...@mail.gmail.com: Even if the sum-up is done in a logarithmic time (with binary tree style), the collecting time of all infomation from all nodes is proportional to the number of nodes if the master node has few communication ports, isn't it? No (unless I misunderstood what you mean, sorry in that case!) ! Use a tree of nodes, to agregate informations, and everything is logarithmic. This is implicitly done in MPI. If you have 8 nodes A, B, C, D, E, F, G, H, then (i) first layer A and B send information to B C and D send information to D E and F send information to F G and H send information to H (ii) second layer B and D send information to D F and H send information to H (iii) third layer D and H send information to H then do the same in the reverse order so that the cumulated information is sent back to all nodes. Interesting, surely the order is almost logarithmic. But how long it takes a packet to pass through a layer. I'm afraid the actual delay time may increase. By the way, have you experimented not averaging but just adding sceme? When I tested that my code had some bugs and no success. Yes, we have tested. Surprisingly, no significant difference. But I don't know if this would still hold today, as we have some pattern-based exploration. For a code with a score almost only depending on percentages, it's not surprising that averaging and summing are equivalent. Simple adding has an advantage that no synchronization to sum-up all statstical numbers of all computers is required and so the time from sending a statistics packet to receiving adding it to the root node will be reduced.This advantage, however, may not be effective in MPI environments because the number of packets inceases from N to N^2 if real (ie. using UDP) broadcasting is not used. It's not so surprising that there was no significant difference in MPI environments. Ah, if the tree structure is used to broadcast packets, things may vary. Thaks a lot, Hideki -- g...@nue.ci.i.u-tokyo.ac.jp (Kato) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [SPAM] Re: [SPAM] Re: [SPAM] Re: [SPAM] Re: [computer-go] Re: A cluster version of Zen is running on cgos 19x19
Interesting, surely the order is almost logarithmic. But how long it takes a packet to pass through a layer. I'm afraid the actual delay time may increase. With gigabit ethernet my humble opinion is that you should have no problem. But, testing what happens if you artificially cancel the time of the messages might confirm/infirm this. If you have troubles due to the communication time, I'm sure you can optimize it. MPI provides plenty of well done primitives for encoding communications. Except if you need very precise optimization, it's not worth working directly with sockets. Olivier ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [SPAM] Re: [computer-go] Re: A cluster version of Zen is running on cgos 19x19
The performance gap is perhaps due to the algorithms. Almost all cluster versions of current strong programs (MoGo, MFG, Fuego and Zen) use root parallel while shared memory computers allow us to use thread parallelism, which gives better performance. I think you should not have troubles with your networks, at least with the number of machines you are considering. Perhaps you should increase a little the time between two communications ? With something like mpi_all_reduce for averaging the statistics over all the tree at each communication, more than 3 or 4 communications per second is useless. Averaging statistics in nodes with less than 5% of the total number of simulations might be useless also. Best regards, Olivier ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [SPAM] Re: [computer-go] Re: A cluster version of Zen is running on cgos 19x19
Thank you Oliver, Olivier Teytaud: aa5e3c330911242304tc6b9e1bk466b1f08cb65d...@mail.gmail.com: The performance gap is perhaps due to the algorithms. Almost all cluster versions of current strong programs (MoGo, MFG, Fuego and Zen) use root parallel while shared memory computers allow us to use thread parallelism, which gives better performance. I think you should not have troubles with your networks, at least with the number of machines you are considering. Perhaps you should increase a little the time between two communications ? With something like mpi_all_reduce for averaging the statistics over all the tree at each communication, more than 3 or 4 communications per second is useless. Averaging statistics in nodes with less than 5% of the total number of simulations might be useless also. In your (or Sylvain's?) recent paper, you wrote less than one second interval was useless. I've observed similar. I'm now evaluating the performance with 0.2, 0.4, 1 and 4 second intervals for 5 second per move setting on 19x19 board on 32 nodes of HA8000 cluster. Though I have not enough games yet, current best is 1 second interval which improves about 400 Elo in self-play. Then, why we have similar experiments with different implementations of root parallelism, based on different programs and on different clusters? I don't use MPI for the cluster version of Zen. Zen's playouts are slower than MoGo's. Etc... One second is a mysterious time :(. Hideki -- g...@nue.ci.i.u-tokyo.ac.jp (Kato) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/