[MediaWiki-commits] [Gerrit] Add first pass at magnitude of change metrics to comparison ... - change (wikimedia...relevanceForge)

EBernhardson (Code Review) Sat, 19 Mar 2016 00:29:25 -0700

EBernhardson has submitted this change and it was merged.

Change subject: Add first pass at magnitude of change metrics to comparison 
report
......................................................................



Add first pass at magnitude of change metrics to comparison report

Added calculation of mean, s.d., and median for number of changes to
TopN results, and to TotalHits. Added support for histograms of changes.

Updated documentation on the report, metrics, stats, charts, and
python prerequisites.

Change-Id: I643f35755437041e1a56b159638bf72bef451376
---
M README.md
M relcomp.py
2 files changed, 164 insertions(+), 26 deletions(-)

Approvals:
  DCausse: Verified; Looks good to me, but someone else must approve
  EBernhardson: Verified; Looks good to me, approved



diff --git a/README.md b/README.md
index 8948823..1104201 100644
--- a/README.md
+++ b/README.md
@@ -19,7 +19,7 @@
 ## Prerequisites
 
 * Python: There's nothing too fancy here, and it works with Python 2.7, though 
a few packages are required:
- * The package `jsonpath-rw` is required by the main Rel Lab.
+ * The packages `jsonpath-rw, numpy` and `matplotlib` are required by the main 
Rel Lab.
  * The package `termcolor` is required by the Cirrus Query Debugger.
  * If you don't have one of these packages, you can get it with `pip install 
<package-name>` (`sudo` may be required to install packages).
 * SSH access to the host you intend to connect to.
@@ -82,6 +82,30 @@
 
 A directory for each comparison between `[test1]` and `[test2]` is created un 
the `relevance/comparisons/` directory. The name is a concatenation of the 
"safe" versions of the `name`s given to the query sets. The original `.ini` 
file is copied to `config.ini`, the final report is in `report.html`, and the 
diffs are stored in the `diffs/` directory, and are named in order as 
`diff#.html`.
 
+### Report Metrics
+
+At the moment, report metrics are specified in code, in `relcomp.py`, in 
function `main()`, in the array `myMetrics`. Metrics are presented in the 
report in the order they are listed in `myMetrics`.
+
+**`QueryCount`** gives a count of queries in each of corpus. It was also a 
convenient place to add statistics and charts (see below) for the number of 
TotalHits (which can be toggled with the `resultscount` parameter). 
`QueryCount` does not show any Diffs (see below).
+
+**`ZeroResultsRate`** calculates the zero results rate for each corpus/config 
combo, and computes the difference between these rates between baseline and 
delta. `ZeroResultsRate` does show Diffs (see below).
+
+**`TopNDiff`** looks at and reports the number of queries with differences in 
the top *n* results returned. *n* can be set to any integer (but shouldn't be 
larger than the number of results requested by the `searchCommand`, or the 
results won't be very meaningful. Differences can be considered `sorted` or 
not; e.g., if `sorted=True`, then swapping the top two results counts as a 
difference, if `sorted=False` then it does not. `TopNDiff` does show Diffs (see 
below).
+
+`TopNDiff` also includes an option, `showstats`, to show statistics and charts 
(see below) for unsorted changes. The difference between two results sets is 
based on the number of items that need to be added, changed, or removed to 
change one set into another (similar to an unordered edit distance).
+
+It makes sense to have multiple `TopNDiff` metrics—e.g., sorted and unsorted 
top 3, 5, 10, and 20—since these different stats tell different stories.
+
+**Statistics and Charts:** When statistics and charts are to be displayed, the 
mean (μ), standard deviation (σ), and median are computed, both for the 
number/count of differences and the percent differences. These can be very 
different or nearly identicial. For example, if every query got one more result 
in TotalHits, then that's +1 for every query, but for a query that originally 
had 1 result, it's +100%, but for a query that had 100 results, it's only +1%. 
For results that change from 0, (i.e., from 0 results to 5 results), the 
denominator used is 1 (so 0 to 5 is +500%).
+
+Three charts are currently provided: number/count differences ("All queries, 
by number of changed ——"), number/count differences after dropping all 0 
changes ("Changed queries, by number of changed ——"), and percent differences 
after dropping all 0 changes ("Changed queries, by percent of changed ——"). 
Since a change affecting 40% of queries is a pretty big change, the "0 changes" 
part of the graph often wildly dominates the rest. Dropping them effectively 
allows zooming in on the rest.
+
+Charts are currently automatically generated by matplotlib, and sometimes have 
trouble with scale and outliers. Still, it's nice to get some idea of the 
distribution since the distributions of changes we see are often not normal, 
and thus μ, σ, and median are useful benchmarks, but don't tell the whole story.
+
+The charts are presented in the report scaled fairly small, though they are 
presented in a standard order, and each is a link to the full-sized image.
+
+**Diffs and `printnum`:** For metrics that report Diffs, the Diffs section of 
the report gives examples of queries that show the differences in question. 
Each metrics takes a `printnum` parameter that determines how many examples to 
show. By default, the parameter is set on the command line (default to 20) and 
shared across all metrics, though that can be overriden for any particular 
metric. If all the instances of a diff are to be shown (e.g., because 
`printnum` is 20 but there are only 5 examples), then they are shown in the 
order they appear in the corpora. If the only a sample is to be shown, then a 
random sample of size `printnum` is randomly selected and shown in a random 
order.
+
 
 ## Other Tools
 
diff --git a/relcomp.py b/relcomp.py
index cee3148..7370322 100755
--- a/relcomp.py
+++ b/relcomp.py
@@ -21,8 +21,13 @@
 # TODO:
 # read info from the .ini file to get names and maybe other info
 
+from __future__ import division
+
 import argparse
 import json
+import matplotlib.pyplot as plt
+import matplotlib.ticker as tick
+import numpy
 import os
 import sys
 import textwrap
@@ -31,6 +36,9 @@
 from itertools import izip_longest
 from random import shuffle
 
+target_path = ""
+image_dir = "images/"
+image_path = ""
 
 class Metric(object):
     """A metric of some sort that we want to keep track of while comparing two
@@ -86,7 +94,7 @@
         baseline_is = False  # does baseline qualify?
         delta_is = False     # does delta qualify?
 
-        if self.has_condition(baseline, delta):
+        if self.has_condition(baseline, delta, is_baseline=True):
             baseline_is = True
             self.baseline_count += 1
 
@@ -143,7 +151,7 @@
             return ret_string.encode('ascii', 'xmlcharrefreplace')
 
         elif self.printnum > 0:  # diff
-            ret_string = "<b>{}:</b>\n".format(self.name)
+            ret_string = "<b>{}</b>\n".format(self.name)
             ret_string += toggle_string()
             printed = 0
             if self.printset == "random":
@@ -177,8 +185,10 @@
         return ""
 
     @abstractmethod
-    def has_condition(self, x, y):
-        """Return true or false on whether the condition of the metric is 
satisfied."""
+    def has_condition(self, x, y, is_baseline):
+        """Return true or false on whether the condition of the metric is 
satisfied.
+           Can also gather other statistics for more complex metrics here.
+        """
         pass
 
 
@@ -192,7 +202,7 @@
                                               symbols=["&darr;", "&uarr;"],
                                               printnum=printnum)
 
-    def has_condition(self, x, y):
+    def has_condition(self, x, y, is_baseline=False):
         """Simple check: is totalHits == 0?
         """
         if "totalHits" in x:
@@ -207,14 +217,24 @@
 
     __metaclass__ = ABCMeta
 
-    def __init__(self, topN=5, sorted=False, printnum=20):
+    def __init__(self, topN=5, sorted=False, showstats=False, printnum=20):
         sortstr = "Sorted" if sorted else "Unsorted"
         self.sorted = sorted
         self.topN = topN
+        self.magnitude = []
+        self.showstats = showstats
         super(TopNDiff, self).__init__("Top {} {} Results Differ".format(topN, 
sortstr),
                                        symmetric=True, printnum=printnum)
 
-    def has_condition(self, x, y):
+    def results(self, what="diff"):
+        global image_path
+        ret_string = super(TopNDiff, self).results(what)
+        if what == "delta" and not self.sorted and self.showstats:
+            ret_string += num_num0_pct_chart(self.magnitude, 
"top{}".format(self.topN),
+                "Top {} Results".format(self.topN))
+        return ret_string
+
+    def has_condition(self, x, y, is_baseline=False):
         if "totalHits" in x:
             x_hits = x["totalHits"]
         else:
@@ -225,34 +245,62 @@
             y_hits = 0
 
         if x_hits == 0 and y_hits == 0:
+            if not self.sorted and self.showstats:
+                self.magnitude.append([0, 0])
             return 0  # no hits means no diff
 
         x_ids = map((lambda r: r["pageId"]), x["rows"][0:self.topN])
         y_ids = map((lambda r: r["pageId"]), y["rows"][0:self.topN])
 
-        if len(x_ids) != len(y_ids):
-            return 1
-
         if self.sorted:
+            if len(x_ids) != len(y_ids):
+                return 1
             if x_ids == y_ids:
                 return 0
         else:
-            if set(x_ids) == set(y_ids):
+            x_ids = set(x_ids)
+            y_ids = set(y_ids)
+            if self.showstats:
+                intersection = x_ids.intersection(y_ids)
+                edit_dist = max(len(x_ids) , len(y_ids)) - len(intersection)
+                self.magnitude.append([len(x_ids), edit_dist])
+            if len(x_ids) != len(y_ids):
+                return 1
+            if x_ids == y_ids:
                 return 0
 
         return 1
 
 
 class QueryCount(Metric):
-    """A count of queries in this query set."""
+    """A count of queries in this query set. Also includes stats on TotalHits 
per query."""
 
     __metaclass__ = ABCMeta
 
-    def __init__(self):
+    def __init__(self, resultscount=True):
+        self.resultscount = resultscount
+        self.magnitude = []
         super(QueryCount, self).__init__("Query Count", raw_count=True, 
printnum=0)
 
-    def has_condition(self, x, y):
+    def has_condition(self, x, y, is_baseline=False):
+        if self.resultscount and is_baseline:
+            if "totalHits" in x:
+                x_hits = x["totalHits"]
+            else:
+                x_hits = 0
+            if "totalHits" in y:
+                y_hits = y["totalHits"]
+            else:
+                y_hits = 0
+            self.magnitude.append([x_hits, y_hits-x_hits])
         return not len(x) == 0
+
+    def results(self, what="diff"):
+        global image_path
+        ret_string = super(QueryCount, self).results(what)
+        if what == "delta" and self.resultscount:
+            ret_string += num_num0_pct_chart(self.magnitude, "querycount", 
"TotalHits")
+        return ret_string
 
 
 def make_query_string(x, y):
@@ -274,8 +322,9 @@
         return query_string
 
 
-def print_report(target_dir, diff_count, file1, file2, myMetrics, errors):
-    report_file = open(target_dir + "report.html", "w")
+def print_report(diff_count, file1, file2, myMetrics, errors):
+    global target_path
+    report_file = open(target_path + "report.html", "w")
     report_file.write(textwrap.dedent("""\
         <script>
         function toggle (button, span) {{
@@ -299,10 +348,10 @@
         <h2>Comparison run summary: {}</h2>
         <blockquote>
         <b>Stats:</b> {} query pairs compared<br>
-        """).format(target_dir, diff_count))
+        """).format(target_path, diff_count))
 
     if len(errors):
-        report_file.write("<br>\n<font color=red><b>QUERY PAIRS WITH ERRORS: " 
+
+        report_file.write("<br>\n<font color=red><b>QUERY PAIRS WITH ERRORS " +
                           "{}</b></font>\n".format(len(errors)))
         report_file.write(toggle_string())
         printed = 0
@@ -359,6 +408,64 @@
 toggle_string.num = 0
 
 
+def make_hist(data, file, title="", xlab="", ylab="", bins=0, yformat="", 
xformat=""):
+    plt.clf()
+    if bins:
+        plt.hist(data, bins)
+    else:
+        plt.hist(data)
+    if title:
+        plt.title(title)
+    if xlab:
+        plt.xlabel(xlab)
+    if ylab:
+        plt.ylabel(ylab)
+    axes = plt.gca()
+    if yformat == "pct":
+        vals = axes.get_yticks()
+        axes.set_yticklabels(['{:3.2f}%'.format(x*100) for x in vals])
+    else:
+        axes.get_yaxis().set_major_locator(tick.MaxNLocator(integer=True))
+    if xformat == "pct":
+        vals = axes.get_xticks()
+        axes.set_xticklabels(['{:3.2f}%'.format(x*100) for x in vals])
+    else:
+        axes.get_xaxis().set_major_locator(tick.MaxNLocator(integer=True))
+    fig = plt.gcf()
+    fig.savefig(file)
+
+
+def num_num0_pct_chart(data, file_prefix, label):
+    ret_string = ""
+    num_changed = [x[1] for x in data]
+    pct_changed = [x[1]/x[0] if x[0]!=0 else x[1] for x in data]
+    indent = "&nbsp;&nbsp; &nbsp;&nbsp; "
+    file_num0 = "{}_num0.png".format(file_prefix)
+    file_num = "{}_num.png".format(file_prefix)
+    file_pct = "{}_pct.png".format(file_prefix)
+    make_hist(num_changed, image_path + file_num0,
+        xlab="Number {} Changed".format(label), ylab="Frequency",
+        title="All queries, by number of changed {}".format(label))
+    make_hist([x for x in num_changed if x != 0], image_path + file_num,
+        xlab="Number {} Changed".format(label), ylab="Frequency",
+        title="Changed queries, by number of changed {}".format(label))
+    make_hist([x for x in pct_changed if x != 0], image_path + file_pct,
+        xlab="Percent {} Changed".format(label), ylab="Frequency", 
xformat="pct",
+        title="Changed queries, by percent of changed {}".format(label))
+    ret_string += indent + "Num {} Changed: &mu;: ".format(label) +\
+        "{:0.2f}; &sigma;: {:0.2f}; median: {:0.2f}<br>\n".format(
+        numpy.mean(num_changed), numpy.std(num_changed), 
numpy.median(num_changed))
+    ret_string += indent + "Pct {} Changed: &mu;: ".format(label) +\
+        "{:0.1f}%; &sigma;: {:0.1f}%; median: {:0.1f}%<br>\n".format(
+        numpy.mean(pct_changed)*100, numpy.std(pct_changed)*100, 
numpy.median(pct_changed)*100)
+    ret_string += indent + "Charts " + toggle_string() + "<br>\n" +\
+        indent + "<a href='{0}'><img src='{0}' 
height=125></a>".format(image_dir + file_num0) +\
+        indent + "<a href='{0}'><img src='{0}' 
height=125></a>".format(image_dir + file_num) +\
+        indent + "<a href='{0}'><img src='{0}' 
height=125></a>".format(image_dir + file_pct) +\
+        "</span><br>\n"
+    return ret_string
+
+
 def main():
     parser = argparse.ArgumentParser(
         description="Generate a report comparing two relevance lab query runs",
@@ -372,11 +479,16 @@
     args = parser.parse_args()
 
     (file1, file2) = args.file
-    target_dir = args.dir + "/"
+    global target_path
+    global image_path
+    target_path = args.dir + "/"
+    image_path = target_path + image_dir
     printnum = int(args.printnum)
 
-    if not os.path.exists(target_dir):
-        os.makedirs(os.path.dirname(target_dir))
+    if not os.path.exists(target_path):
+        os.makedirs(os.path.dirname(target_path))
+    if not os.path.exists(image_path):
+        os.makedirs(os.path.dirname(image_path))
 
     diff_count = 0
     errors = {}
@@ -386,10 +498,12 @@
     myMetrics = [
         QueryCount(),
         ZeroResultsRate(printnum=printnum),
-        TopNDiff(3, sorted=False, printnum=printnum),
         TopNDiff(3, sorted=True, printnum=printnum),
-        TopNDiff(5, sorted=False, printnum=printnum),
-        TopNDiff(5, sorted=True, printnum=printnum)
+        TopNDiff(3, sorted=False, printnum=printnum),
+        TopNDiff(5, sorted=True, printnum=printnum),
+        TopNDiff(5, sorted=False, printnum=printnum, showstats=True),
+        TopNDiff(20, sorted=True, printnum=printnum),
+        TopNDiff(20, sorted=False, printnum=printnum, showstats=True)
         ]
 
     with open(file1) as a, open(file2) as b:
@@ -413,7 +527,7 @@
             for m in myMetrics:
                 m.measure(ajson, bjson, diff_count)
 
-    print_report(target_dir, diff_count, file1, file2, myMetrics, errors)
+    print_report(diff_count, file1, file2, myMetrics, errors)
 
 
 if __name__ == "__main__":

-- 
To view, visit https://gerrit.wikimedia.org/r/277590
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I643f35755437041e1a56b159638bf72bef451376
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/relevanceForge
Gerrit-Branch: master
Gerrit-Owner: Tjones <[email protected]>
Gerrit-Reviewer: DCausse <[email protected]>
Gerrit-Reviewer: EBernhardson <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Add first pass at magnitude of change metrics to comparison ... - change (wikimedia...relevanceForge)

Reply via email to