I ended up writing a new metric-test function for this problem, which I
will likely release as a library soon. For now, the function is just
embedded in the project where I first needed it.
Usage looks like the following. Note that the baseline argument is
optional. If the baseline argument is not present, then the test will fail
and will recommend a starting value for baseline.
(ns figurer.car-example-test
(:require [clojure.test :refer :all]
[figurer.car-example :refer :all]
[figurer.core :as figurer]
[metric-test.core :refer [metric-test]]))
(def metrics
{:expected-value figurer/expected-value
:plan-value (fn [solution]
(let [plan (figurer/sample-plan solution)
plan-value (/ (apply +
(map (:value solution) (:states plan)
))
(count (:states plan)))]
plan-value))})
(deftest gentle-turn-metric-test
(metric-test "gentle turn metric 0.1"
#(figurer/figure gentle-turn-problem {:max-seconds 0.1})
:metrics metrics
:baseline {:expected-value {:mean 62.508, :stdev 0.542}
:plan-value {:mean 70.569, :stdev 1.46}}))
And output for a failing test looks like this:
FAIL in (gentle-turn-metric-test) (core.clj:112)
gentle turn metric 0.1
Some metrics changed significantly compared to the baseline.
| Metric | Old | New | Change
| Unusual |
|-----------------+----------------+----------------+-----------------------+---------|
| :expected-value | 62.508 ± 0.542 | 72.566 ± 0.499 | 10.058 (18.558 stdev)
| * |
| :plan-value | 70.569 ± 1.46 | 70.541 ± 1.378 | -0.028 (-0.019 stdev)
| |
New baseline if these changes are accepted:
{:expected-value {:mean 72.566, :stdev 0.499},
:plan-value {:mean 70.541, :stdev 1.378}}
expected: false
actual: false
Here is the commit in which I created the metric-test function and used it
for just one of my tests:
https://github.com/ericlavigne/figurer/commit/1153b5d4db898d042de6e3aa0ab9d77e65c6e3cc
On Saturday, October 6, 2018 at 5:41:27 PM UTC-4, Eric Lavigne wrote:
>
> *Summary*
>
> I am writing tests involving multiple metrics with tradeoffs. When I make
> a software change, the tests should show check for changes across any of
> these metrics and show me that I was able to improve along one metric, but
> at the expense of another metric. If I decide that these changes are
> overall acceptable, I should be able to quickly modify the test based on
> this new baseline. So far I am following this strategy using just
> clojure.test and a bunch of custom code.
>
> Is there a testing library that would help with this? If not, does anyone
> else have use for such a tool?
>
> *Current code*
>
>
> https://github.com/ericlavigne/figurer/blob/master/src/figurer/test_util.clj
>
> *Details of my problem*
>
> I am writing performance tests for a Monte Carlo tree search library
> called figurer. In the beginning, the tests were focused on avoiding
> regression. I recorded that within 0.1 second figurer could find a solution
> with value between 71.7 and 73.6 (based on trying this 10 times) and wrote
> a test that would fail if the value was higher or lower than this range.
> This was helpful for determining whether the algorithm was getting better
> or worse, but did not help with why the algorithm was getting better or
> worse.
>
> I made a change to the library to focus more on refinement of paths that
> showed early promise, rather than spreading attention evenly across all
> candidates. I expected that this change would substantially improve the
> value found, but instead it slightly reduced that value. There were many
> possible explanations. Maybe the more sophisticated algorithm did a better
> job of choosing new paths to try, but at too much cost in time spent per
> path evaluation. Maybe the new algorithm focused too much on refining a
> path that showed early promise, ignoring a better path whose that got an
> unlucky roll of the dice early on. I needed to compare a variety of
> performance metrics between the old and new versions of the code.
>
> Commit for the unexpected results described above:
>
> Switch from random to UCT-based exploration (worse performance)
>
> https://github.com/ericlavigne/figurer/commit/97c76b88ac3de0874444b0cfa55005ab909aba21
>
> I would like to track all of the following metrics to help me understand
> the effect of each code change.
>
> 1) Estimated value of the chosen plan
> 2) Closeness of the plan's first move to the best plan
> 3) Number of plans that were considered (raw speed)
> 4) Closeness of closest candidate first move to the best plan
> 5) Number of first moves that were considered
> 6) Evaluation depth of the chosen plan
> 7) Maximum evaluation depth across all considered plans
>
> For each metric, I need to record a baseline distribution by running the
> code multiple times. The test will need to check whether new measurements
> are consistent with that recorded distribution. If any metric is measured
> outside the expected range, then a report should show me all metrics and
> how they changed. The same report should also include a new baseline data
> structure that I can copy back into my tests if I decide to accept this
> result as my new baseline.
>
> The closest I've found so far is clojure-expectations, which has support
> for comparing multiple values (via a map) as well as ranges (via
> approximately). I would likely build on top of those capabilities and add
> support for the baselining process.
>
> https://clojure-expectations.github.io/
>
> *Is there another library that better matches this need? Anyone have a
> better approach for the problem?*
>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.