[MediaWiki-commits] [Gerrit] Make relForge more robust by filtering out non-JSON lines fr... - change (wikimedia...relevanceForge)

jenkins-bot (Code Review) Fri, 08 Apr 2016 06:18:08 -0700

jenkins-bot has submitted this change and it was merged.

Change subject: Make relForge more robust by filtering out non-JSON lines from 
results
......................................................................



Make relForge more robust by filtering out non-JSON lines from results

results files with non-JSON lines cause expensive engineScore.py runs
to fail mid-process. Filtering them out allows processing to continue.

Non-JSON lines are saved to results.isnotjson for inspection.

Also includes updates to README.md.

Change-Id: I697d6cfd6f63baa285db06460632b4e33b47f22a
---
M README.md
M relevancyRunner.py
2 files changed, 30 insertions(+), 11 deletions(-)

Approvals:
  DCausse: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/README.md b/README.md
index 42a35df..7d9546e 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
     /_/|_|\__/_/\__/|___/\_,_/_//_/\__/\__/ /_/  \___/_/  \_, /\__/ 
                                                          /___/
 
-The primary purpose of the Relevance Lab is to allow us<sup>†</sup> to 
experiment with proposed modifications to our search process and gauge their 
effectiveness<sup>‡</sup> and impact<sup>§</sup> before releasing them into 
production, and even before doing any kind of user acceptance or A/B testing. 
Also, testing in the relevance lab gives an additional benefit over A/B tests 
(esp. in the case of very targeted changes): with A/B tests we aren't 
necessarily able to test the behavior of the *same query* with two different 
configurations.
+The primary purpose of the Relevance Forge is to allow us<sup>†</sup> to 
experiment with proposed modifications to our search process and gauge their 
effectiveness<sup>‡</sup> and impact<sup>§</sup> before releasing them into 
production, and even before doing any kind of user acceptance or A/B testing. 
Also, testing in the Relevance Forge gives an additional benefit over A/B tests 
(esp. in the case of very targeted changes): with A/B tests we aren't 
necessarily able to test the behavior of the *same query* with two different 
configurations.
 
 <small>
 \* Also known as RelForge to save a few keystrokes
@@ -19,7 +19,7 @@
 ## Prerequisites
 
 * Python: There's nothing too fancy here, and it works with Python 2.7, though 
a few packages are required:
- * The packages `jsonpath-rw, numpy` and `matplotlib` are required by the main 
Rel Lab.
+ * The packages `jsonpath-rw, numpy` and `matplotlib` are required by the main 
Rel Forge.
  * The package `termcolor` is required by the Cirrus Query Debugger.
  * The package `scipy` is required by the Engine Score Optimizer
  * The package `matplotlib` is required by the Engine Score Optimizer
@@ -28,23 +28,23 @@
 
 ## Invocation
 
-The main Rel Lab process is `relevancyRunner.py`, which takes a `.ini` config 
file (see below):
+The main Rel Forge process is `relevancyRunner.py`, which takes a `.ini` 
config file (see below):
 
         relevancyRunner.py -c relevance.ini
 
 ### Processes
 
-`relevancyRunner.py` parses the `.ini` file (see below), manages 
configuration, runs the queries against the Elasticsearch cluster and outputs 
the results, and then delegates diffing the results to the `jsonDiffTool` 
specified in the `.ini` file, and delegated the final report to the 
`metricTool` specified in the `.ini` file. It also archives the original 
queries and configuration (`.ini` and JSON `config` files) with the Rel Lab run 
output.
+`relevancyRunner.py` parses the `.ini` file (see below), manages 
configuration, runs the queries against the Elasticsearch cluster and outputs 
the results, and then delegates diffing the results to the `jsonDiffTool` 
specified in the `.ini` file, and delegated the final report to the 
`metricTool` specified in the `.ini` file. It also archives the original 
queries and configuration (`.ini` and JSON `config` files) with the Rel Forge 
run output.
 
 The `jsonDiffTool` is implemented as `jsondiff.py`, "an almost smart enough 
JSON diff tool". It's actually not that smart: it munges the search results 
JSON a bit, pretty-prints it, and then uses Python's HtmlDiff to make 
reasonably pretty output.
 
-The `metricTool` is implemented as `relcomp.py`, which generates an HTML 
report comparing two relevance lab query runs. A number of metrics are defined, 
including zero results rate and a generic top-N diffs (sorted or not). Adding 
and configuring these metrics can be done in `main`, in the array `myMetrics`. 
Examples of queries that change from one run to the next for each metric are 
provided, with links into the diffs created by `jsondiff.py`.
+The `metricTool` is implemented as `relcomp.py`, which generates an HTML 
report comparing two Relevance Forge query runs. A number of metrics are 
defined, including zero results rate and a generic top-N diffs (sorted or not). 
Adding and configuring these metrics can be done in `main`, in the array 
`myMetrics`. Examples of queries that change from one run to the next for each 
metric are provided, with links into the diffs created by `jsondiff.py`.
 
 Running the queries is typically the most time-consuming part of the process. 
If you ask for a very large number of results for each query (≫100), the diff 
step can be very slow. The report processing is generally very quick.
 
 ### Configuration
 
-The Rel Lab is configured by way of an .ini file. A sample, `relevance.ini`, 
is provided. Global settings are provided in `[settings]`, and config for the 
two test runs are in `[test1]` and `[test2]`.
+The Rel Forge is configured by way of an .ini file. A sample, `relevance.ini`, 
is provided. Global settings are provided in `[settings]`, and config for the 
two test runs are in `[test1]` and `[test2]`.
 
 Additional command line arguments can be added to `searchCommand` to affect 
the way the queries are run (such as what wiki to run against, changing the 
number of results returned, and including detailed scoring information.
 
@@ -78,9 +78,9 @@
 
 ## Output
 
-By default, Rel Lab run results are written out to the `relevance/` directory. 
This can be configured under `workDir` under `[settings]` in the `.ini` file.
+By default, Rel Forge run results are written out to the `relevance/` 
directory. This can be configured under `workDir` under `[settings]` in the 
`.ini` file.
 
-A directory for each query set is created in the `relevance/queries/` 
directory. The directory is a "safe" version of the `name` given under 
`[test#]`. This directory contains the queries, the results, and a copy of the 
JSON config file used, if any, under the name `config.json`.
+A directory for each query set is created in the `relevance/queries/` 
directory. The directory is a "safe" version of the `name` given under 
`[test#]`. This directory contains the `queries`, the `results`, and a copy of 
the JSON config file used, if any, under the name `config.json`. If `results` 
contains non-JSON lines, these are filtered out to `results.isnotjson` for 
inspection.
 
 A directory for each comparison between `[test1]` and `[test2]` is created un 
the `relevance/comparisons/` directory. The name is a concatenation of the 
"safe" versions of the `name`s given to the query sets. The original `.ini` 
file is copied to `config.ini`, the final report is in `report.html`, and the 
diffs are stored in the `diffs/` directory, and are named in order as 
`diff#.html`.
 
@@ -111,7 +111,7 @@
 
 ## Other Tools
 
-There are a few other bits and bobs included with the Rel Lab.
+There are a few other bits and bobs included with the Rel Forge.
 
 ### Cirrus Query Debugger
 
@@ -145,7 +145,7 @@
 
 ### Import Indices
 
-Import Indices (`importindices.py`) downloads Elasticsearch indices from 
wikimedia dumps and imports them to an Elasticsearch cluster. It lives with the 
Rel Lab but is used on the Elasticsearch server you connect to, not your local 
machine.
+Import Indices (`importindices.py`) downloads Elasticsearch indices from 
wikimedia dumps and imports them to an Elasticsearch cluster. It lives with the 
Rel Forge but is used on the Elasticsearch server you connect to, not your 
local machine.
 
 ### Piecewise Linear Model of an Empirical Distribution Function
 
@@ -199,7 +199,7 @@
 
 ### Gerrit Config
 
-These files help Gerrit process patches correctly and are not directly part of 
the Rel Lab:
+These files help Gerrit process patches correctly and are not directly part of 
the Rel Forge:
 
 * `setup.cfg`
 * `tox.ini`
diff --git a/relevancyRunner.py b/relevancyRunner.py
index a2bcba3..1efff4d 100755
--- a/relevancyRunner.py
+++ b/relevancyRunner.py
@@ -41,6 +41,23 @@
     return '%s/%s/%s' % (config.get('settings', 'workdir'), subdir, qname)
 
 
+def sanitize_json(file):
+    file_isjson = file + '.isjson'
+    file_isnotjson = file + '.isnotjson'
+    isjson = open(file_isjson, 'w')
+    isnotjson = open(file_isnotjson, 'w')
+    for line in open(file).readlines():
+        if line.startswith('{'):
+            isjson.write(line)
+        else:
+            isnotjson.write(line)
+    isjson.close()
+    isnotjson.close()
+    shutil.move(file_isjson, file)
+    if os.path.getsize(file_isnotjson) == 0:
+        os.remove(file_isnotjson)
+
+
 def runSearch(config, section):
     qdir = getSafeWorkPath(config, section, 'queries')
     cmdline = config.get(section, 'searchCommand')
@@ -65,6 +82,8 @@
                                             pipes.quote(cmdline),
                                             results_file))
     shutil.copyfile(config.get(section, 'queries'), qdir + '/queries')  # 
archive queries
+    # sanitize json
+    sanitize_json(results_file)
     return results_file
 
 

-- 
To view, visit https://gerrit.wikimedia.org/r/281023
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I697d6cfd6f63baa285db06460632b4e33b47f22a
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/relevanceForge
Gerrit-Branch: master
Gerrit-Owner: Tjones <[email protected]>
Gerrit-Reviewer: DCausse <[email protected]>
Gerrit-Reviewer: EBernhardson <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Make relForge more robust by filtering out non-JSON lines fr... - change (wikimedia...relevanceForge)

Reply via email to