Re: metric-based testing (evaluating changes to Monte Carlo tree search library)

Eric Lavigne Sun, 07 Oct 2018 16:07:40 -0700

I ended up writing a new metric-test function for this problem, which I 
will likely release as a library soon. For now, the function is just 
embedded in the project where I first needed it.


Usage looks like the following. Note that the baseline argument is 
optional. If the baseline argument is not present, then the test will fail 
and will recommend a starting value for baseline.

(ns figurer.car-example-test
  (:require [clojure.test :refer :all]
            [figurer.car-example :refer :all]
            [figurer.core :as figurer]
            [metric-test.core :refer [metric-test]]))

(def metrics
  {:expected-value figurer/expected-value
   :plan-value (fn [solution]
                 (let [plan (figurer/sample-plan solution)
                       plan-value (/ (apply +
                                       (map (:value solution) (:states plan)
))
                                     (count (:states plan)))]
                   plan-value))})

(deftest gentle-turn-metric-test
  (metric-test "gentle turn metric 0.1"
    #(figurer/figure gentle-turn-problem {:max-seconds 0.1})
    :metrics metrics
    :baseline {:expected-value {:mean 62.508, :stdev 0.542}
               :plan-value {:mean 70.569, :stdev 1.46}}))

And output for a failing test looks like this:

FAIL in (gentle-turn-metric-test) (core.clj:112)
gentle turn metric 0.1

Some metrics changed significantly compared to the baseline.

|          Metric |            Old |            New |                Change 
| Unusual |
|-----------------+----------------+----------------+-----------------------+---------|
| :expected-value | 62.508 ± 0.542 | 72.566 ± 0.499 | 10.058 (18.558 stdev) 
|       * |
|     :plan-value |  70.569 ± 1.46 | 70.541 ± 1.378 | -0.028 (-0.019 stdev) 
|         |

New baseline if these changes are accepted:

{:expected-value {:mean 72.566, :stdev 0.499},
 :plan-value {:mean 70.541, :stdev 1.378}}


expected: false
  actual: false

Here is the commit in which I created the metric-test function and used it 
for just one of my tests:

https://github.com/ericlavigne/figurer/commit/1153b5d4db898d042de6e3aa0ab9d77e65c6e3cc



On Saturday, October 6, 2018 at 5:41:27 PM UTC-4, Eric Lavigne wrote:
>
> *Summary*
>
> I am writing tests involving multiple metrics with tradeoffs. When I make 
> a software change, the tests should show check for changes across any of 
> these metrics and show me that I was able to improve along one metric, but 
> at the expense of another metric. If I decide that these changes are 
> overall acceptable, I should be able to quickly modify the test based on 
> this new baseline. So far I am following this strategy using just 
> clojure.test and a bunch of custom code.
>
> Is there a testing library that would help with this? If not, does anyone 
> else have use for such a tool?
>
> *Current code*
>
>
> https://github.com/ericlavigne/figurer/blob/master/src/figurer/test_util.clj
>
> *Details of my problem*
>
> I am writing performance tests for a Monte Carlo tree search library 
> called figurer. In the beginning, the tests were focused on avoiding 
> regression. I recorded that within 0.1 second figurer could find a solution 
> with value between 71.7 and 73.6 (based on trying this 10 times) and wrote 
> a test that would fail if the value was higher or lower than this range. 
> This was helpful for determining whether the algorithm was getting better 
> or worse, but did not help with why the algorithm was getting better or 
> worse.
>
> I made a change to the library to focus more on refinement of paths that 
> showed early promise, rather than spreading attention evenly across all 
> candidates. I expected that this change would substantially improve the 
> value found, but instead it slightly reduced that value. There were many 
> possible explanations. Maybe the more sophisticated algorithm did a better 
> job of choosing new paths to try, but at too much cost in time spent per 
> path evaluation. Maybe the new algorithm focused too much on refining a 
> path that showed early promise, ignoring a better path whose that got an 
> unlucky roll of the dice early on. I needed to compare a variety of 
> performance metrics between the old and new versions of the code.
>
> Commit for the unexpected results described above:
>
>     Switch from random to UCT-based exploration (worse performance)
>     
> https://github.com/ericlavigne/figurer/commit/97c76b88ac3de0874444b0cfa55005ab909aba21
>
> I would like to track all of the following metrics to help me understand 
> the effect of each code change.
>
> 1) Estimated value of the chosen plan
> 2) Closeness of the plan's first move to the best plan
> 3) Number of plans that were considered (raw speed)
> 4) Closeness of closest candidate first move to the best plan
> 5) Number of first moves that were considered
> 6) Evaluation depth of the chosen plan
> 7) Maximum evaluation depth across all considered plans
>
> For each metric, I need to record a baseline distribution by running the 
> code multiple times. The test will need to check whether new measurements 
> are consistent with that recorded distribution. If any metric is measured 
> outside the expected range, then a report should show me all metrics and 
> how they changed. The same report should also include a new baseline data 
> structure that I can copy back into my tests if I decide to accept this 
> result as my new baseline.
>
> The closest I've found so far is clojure-expectations, which has support 
> for comparing multiple values (via a map) as well as ranges (via 
> approximately). I would likely build on top of those capabilities and add 
> support for the baselining process.
>
> https://clojure-expectations.github.io/
>
> *Is there another library that better matches this need? Anyone have a 
> better approach for the problem?*
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: metric-based testing (evaluating changes to Monte Carlo tree search library)

Reply via email to