jenkins-bot has submitted this change and it was merged.
Change subject: Make relForge more robust by filtering out non-JSON lines from
results
......................................................................
Make relForge more robust by filtering out non-JSON lines from results
results files with non-JSON lines cause expensive engineScore.py runs
to fail mid-process. Filtering them out allows processing to continue.
Non-JSON lines are saved to results.isnotjson for inspection.
Also includes updates to README.md.
Change-Id: I697d6cfd6f63baa285db06460632b4e33b47f22a
---
M README.md
M relevancyRunner.py
2 files changed, 30 insertions(+), 11 deletions(-)
Approvals:
DCausse: Looks good to me, approved
jenkins-bot: Verified
diff --git a/README.md b/README.md
index 42a35df..7d9546e 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
/_/|_|\__/_/\__/|___/\_,_/_//_/\__/\__/ /_/ \___/_/ \_, /\__/
/___/
-The primary purpose of the Relevance Lab is to allow us<sup>†</sup> to
experiment with proposed modifications to our search process and gauge their
effectiveness<sup>‡</sup> and impact<sup>§</sup> before releasing them into
production, and even before doing any kind of user acceptance or A/B testing.
Also, testing in the relevance lab gives an additional benefit over A/B tests
(esp. in the case of very targeted changes): with A/B tests we aren't
necessarily able to test the behavior of the *same query* with two different
configurations.
+The primary purpose of the Relevance Forge is to allow us<sup>†</sup> to
experiment with proposed modifications to our search process and gauge their
effectiveness<sup>‡</sup> and impact<sup>§</sup> before releasing them into
production, and even before doing any kind of user acceptance or A/B testing.
Also, testing in the Relevance Forge gives an additional benefit over A/B tests
(esp. in the case of very targeted changes): with A/B tests we aren't
necessarily able to test the behavior of the *same query* with two different
configurations.
<small>
\* Also known as RelForge to save a few keystrokes
@@ -19,7 +19,7 @@
## Prerequisites
* Python: There's nothing too fancy here, and it works with Python 2.7, though
a few packages are required:
- * The packages `jsonpath-rw, numpy` and `matplotlib` are required by the main
Rel Lab.
+ * The packages `jsonpath-rw, numpy` and `matplotlib` are required by the main
Rel Forge.
* The package `termcolor` is required by the Cirrus Query Debugger.
* The package `scipy` is required by the Engine Score Optimizer
* The package `matplotlib` is required by the Engine Score Optimizer
@@ -28,23 +28,23 @@
## Invocation
-The main Rel Lab process is `relevancyRunner.py`, which takes a `.ini` config
file (see below):
+The main Rel Forge process is `relevancyRunner.py`, which takes a `.ini`
config file (see below):
relevancyRunner.py -c relevance.ini
### Processes
-`relevancyRunner.py` parses the `.ini` file (see below), manages
configuration, runs the queries against the Elasticsearch cluster and outputs
the results, and then delegates diffing the results to the `jsonDiffTool`
specified in the `.ini` file, and delegated the final report to the
`metricTool` specified in the `.ini` file. It also archives the original
queries and configuration (`.ini` and JSON `config` files) with the Rel Lab run
output.
+`relevancyRunner.py` parses the `.ini` file (see below), manages
configuration, runs the queries against the Elasticsearch cluster and outputs
the results, and then delegates diffing the results to the `jsonDiffTool`
specified in the `.ini` file, and delegated the final report to the
`metricTool` specified in the `.ini` file. It also archives the original
queries and configuration (`.ini` and JSON `config` files) with the Rel Forge
run output.
The `jsonDiffTool` is implemented as `jsondiff.py`, "an almost smart enough
JSON diff tool". It's actually not that smart: it munges the search results
JSON a bit, pretty-prints it, and then uses Python's HtmlDiff to make
reasonably pretty output.
-The `metricTool` is implemented as `relcomp.py`, which generates an HTML
report comparing two relevance lab query runs. A number of metrics are defined,
including zero results rate and a generic top-N diffs (sorted or not). Adding
and configuring these metrics can be done in `main`, in the array `myMetrics`.
Examples of queries that change from one run to the next for each metric are
provided, with links into the diffs created by `jsondiff.py`.
+The `metricTool` is implemented as `relcomp.py`, which generates an HTML
report comparing two Relevance Forge query runs. A number of metrics are
defined, including zero results rate and a generic top-N diffs (sorted or not).
Adding and configuring these metrics can be done in `main`, in the array
`myMetrics`. Examples of queries that change from one run to the next for each
metric are provided, with links into the diffs created by `jsondiff.py`.
Running the queries is typically the most time-consuming part of the process.
If you ask for a very large number of results for each query (≫100), the diff
step can be very slow. The report processing is generally very quick.
### Configuration
-The Rel Lab is configured by way of an .ini file. A sample, `relevance.ini`,
is provided. Global settings are provided in `[settings]`, and config for the
two test runs are in `[test1]` and `[test2]`.
+The Rel Forge is configured by way of an .ini file. A sample, `relevance.ini`,
is provided. Global settings are provided in `[settings]`, and config for the
two test runs are in `[test1]` and `[test2]`.
Additional command line arguments can be added to `searchCommand` to affect
the way the queries are run (such as what wiki to run against, changing the
number of results returned, and including detailed scoring information.
@@ -78,9 +78,9 @@
## Output
-By default, Rel Lab run results are written out to the `relevance/` directory.
This can be configured under `workDir` under `[settings]` in the `.ini` file.
+By default, Rel Forge run results are written out to the `relevance/`
directory. This can be configured under `workDir` under `[settings]` in the
`.ini` file.
-A directory for each query set is created in the `relevance/queries/`
directory. The directory is a "safe" version of the `name` given under
`[test#]`. This directory contains the queries, the results, and a copy of the
JSON config file used, if any, under the name `config.json`.
+A directory for each query set is created in the `relevance/queries/`
directory. The directory is a "safe" version of the `name` given under
`[test#]`. This directory contains the `queries`, the `results`, and a copy of
the JSON config file used, if any, under the name `config.json`. If `results`
contains non-JSON lines, these are filtered out to `results.isnotjson` for
inspection.
A directory for each comparison between `[test1]` and `[test2]` is created un
the `relevance/comparisons/` directory. The name is a concatenation of the
"safe" versions of the `name`s given to the query sets. The original `.ini`
file is copied to `config.ini`, the final report is in `report.html`, and the
diffs are stored in the `diffs/` directory, and are named in order as
`diff#.html`.
@@ -111,7 +111,7 @@
## Other Tools
-There are a few other bits and bobs included with the Rel Lab.
+There are a few other bits and bobs included with the Rel Forge.
### Cirrus Query Debugger
@@ -145,7 +145,7 @@
### Import Indices
-Import Indices (`importindices.py`) downloads Elasticsearch indices from
wikimedia dumps and imports them to an Elasticsearch cluster. It lives with the
Rel Lab but is used on the Elasticsearch server you connect to, not your local
machine.
+Import Indices (`importindices.py`) downloads Elasticsearch indices from
wikimedia dumps and imports them to an Elasticsearch cluster. It lives with the
Rel Forge but is used on the Elasticsearch server you connect to, not your
local machine.
### Piecewise Linear Model of an Empirical Distribution Function
@@ -199,7 +199,7 @@
### Gerrit Config
-These files help Gerrit process patches correctly and are not directly part of
the Rel Lab:
+These files help Gerrit process patches correctly and are not directly part of
the Rel Forge:
* `setup.cfg`
* `tox.ini`
diff --git a/relevancyRunner.py b/relevancyRunner.py
index a2bcba3..1efff4d 100755
--- a/relevancyRunner.py
+++ b/relevancyRunner.py
@@ -41,6 +41,23 @@
return '%s/%s/%s' % (config.get('settings', 'workdir'), subdir, qname)
+def sanitize_json(file):
+ file_isjson = file + '.isjson'
+ file_isnotjson = file + '.isnotjson'
+ isjson = open(file_isjson, 'w')
+ isnotjson = open(file_isnotjson, 'w')
+ for line in open(file).readlines():
+ if line.startswith('{'):
+ isjson.write(line)
+ else:
+ isnotjson.write(line)
+ isjson.close()
+ isnotjson.close()
+ shutil.move(file_isjson, file)
+ if os.path.getsize(file_isnotjson) == 0:
+ os.remove(file_isnotjson)
+
+
def runSearch(config, section):
qdir = getSafeWorkPath(config, section, 'queries')
cmdline = config.get(section, 'searchCommand')
@@ -65,6 +82,8 @@
pipes.quote(cmdline),
results_file))
shutil.copyfile(config.get(section, 'queries'), qdir + '/queries') #
archive queries
+ # sanitize json
+ sanitize_json(results_file)
return results_file
--
To view, visit https://gerrit.wikimedia.org/r/281023
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: I697d6cfd6f63baa285db06460632b4e33b47f22a
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/relevanceForge
Gerrit-Branch: master
Gerrit-Owner: Tjones <[email protected]>
Gerrit-Reviewer: DCausse <[email protected]>
Gerrit-Reviewer: EBernhardson <[email protected]>
Gerrit-Reviewer: jenkins-bot <>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits