So, I’m currently running 200 games against GnuGo to see if a change to my program made a difference. But I now wonder if that’s enough games as I ran the same benchmark with the same code (but a different compiler version) and received different results:
85.5% wins (171 games of 200) the first time (+/- 2.5 according to gogui-twogtp) 79.0% wins (158 games of 200) the second time (+/- 2.9 according to gogui-twogtp) Looking at these results would make me believe that the difference is significant (the intervals don’t overlap) but then the real difference is only 13 wins … My statistics knowledge is sketchy at best but assuming that what gogui-twogtp calculates is the 95% confidence interval (I’m pretty sure I’m mixing terms here) it could well be that the difference between the two runs above is just random. So, this leads me to two questions: 1. How many games do you normally run to test if a change is significant “enough”? 2. Any good resources on how to calculate these statistics (i.e. if I wanted to find the error margin for a 99% confidence interval)? Urban -- Blog: http://bettong.net/ Twitter: https://twitter.com/ujh Homepage: http://www.urbanhafner.com/
_______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go