[MediaWiki-commits] [Gerrit] Calculate an engine score - change (wikimedia...relevanceForge)

2016-03-24 Thread Tjones (Code Review)
Tjones has submitted this change and it was merged.

Change subject: Calculate an engine score
..


Calculate an engine score

This is based on Paul Nelson's talk on search relevance. It sources
query and click logs from the TestSearchSatisfaction schema, runs the
queries, and then generates an engine score and a histogram of the
result positions.

Includes a basic integration with scipy.optimize to do a brute force
search over a grid. Graphs the result when optimizing a 1 or 2
dimensional search space.

Also includes an nDCG implementation, but we don't yet have a source
of data to get the scores to use in nDCG with.

Change-Id: I167d29e934b7f048e120f908a28e883a77d61c00
---
M .gitignore
M README.md
A engineScore.ini
A engineScore.py
M relevancyRunner.py
A sql/extract_query_and_click_logs.all_clicks.sql
A sql/extract_query_and_click_logs.sat_clicks.sql
7 files changed, 516 insertions(+), 28 deletions(-)

Approvals:
  Tjones: Verified; Looks good to me, approved



diff --git a/.gitignore b/.gitignore
index 09ce239..28d03ca 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
 relevance/
 .tox/
 *.sw?
+*.pyc
diff --git a/README.md b/README.md
index b4c9f53..42a35df 100644
--- a/README.md
+++ b/README.md
@@ -21,6 +21,8 @@
 * Python: There's nothing too fancy here, and it works with Python 2.7, though 
a few packages are required:
  * The packages `jsonpath-rw, numpy` and `matplotlib` are required by the main 
Rel Lab.
  * The package `termcolor` is required by the Cirrus Query Debugger.
+ * The package `scipy` is required by the Engine Score Optimizer
+ * The package `matplotlib` is required by the Engine Score Optimizer
  * If you don't have one of these packages, you can get it with `pip install 
` (`sudo` may be required to install packages).
 * SSH access to the host you intend to connect to.
 
@@ -181,6 +183,14 @@
 
 Segment specification: The segment specification is an array of tuples, `[x, 
y, slope]`. `(x,y)` is the end of one line segment and the beginning of the 
next (except for the first and last tuple, naturally), and `slope` is the slope 
of the line segment from the previous endpoint (used in the generic and custom 
functions to save on re-computing (y[i]-y[i-1])/(x[i]-x[i-1]) for every single 
evaluation). The first point is given a `slope` of 0, though it isn't used.
 
+### Engine Scoring Optimizer
+
+The Engine Scoring Optimizer ( engineScore.py ) generates a single number 
representing the score of the engine for the query set. It can combine this 
calculation with scipy brute force optimization to explore a multi-dimensional 
space of numeric config values to attempt to find the best values. This works 
similar to the main relevancy runner, which is reused here for running the 
queries.
+
+The Engine Scoring process takes an `.ini` file similar to the main relevancy 
runner:
+
+engineScore.py -c engineScore.ini
+
 ### Miscellaneous
 
 The `misc/` directory contains additional useful stuff:
diff --git a/engineScore.ini b/engineScore.ini
new file mode 100644
index 000..41a74ab
--- /dev/null
+++ b/engineScore.ini
@@ -0,0 +1,41 @@
+[settings]
+stats_server = stat1002.eqiad.wmnet
+mysql_options = 
--defaults-extra-file=/etc/mysql/conf.d/analytics-research-client.cnf --host 
dbstore1002.eqiad.wmnet
+; This query will be formatted with python's string.Formatter using the 
variables from within this settings group
+query = ./sql/extract_query_and_click_logs.sat_clicks.sql
+; controls the importance of result position within the set. Must be between 0 
and 1 exclusive. Higher values
+; place less weight on the position. .9 seems reasonable for a 20 item result 
set.
+factor = .9
+workDir = ./relevance
+; The rest of the variables here are used within the sql query
+date_start = 2016022200
+date_end = 2016022900
+dwell_threshold = 10
+wiki = enwiki
+num_sessions = 2000
+
+[test1]
+name = I need to be better at naming things...tfidf-visitPage
+config = ./conf.json
+labHost = suggesty.eqiad.wmflabs
+searchCommand = sudo -u www-data hhvm /var/www/w/MWScript.php 
extensions/CirrusSearch/maintenance/runSearch.php --baseName enwiki --limit 20 
--cluster nobelium --fork 8
+;labHost = searchdemo.eqiad.wmflabs
+;searchCommand = cd /srv/mediawiki-vagrant && mwvagrant ssh -- mwscript 
extensions/CirrusSearch/maintenance/runSearch.php --baseName enwiki --fork 12 
--limit 20 --cluster labsearch
+
+; brute force optimization. requires scipy and matplotlib. This section
+; may be omitted to calculate the engine score a single time.
+[optimize]
+; Set the minimum and maximum values of the variables
+bounds = [[1, 128], [64, 1024]]
+; Ns is a set of bounds which will be interpolated into Ns points
+; between the low and high values of bounds, inclusive.
+Ns = 3
+; display graph of the results? A result graph will be written out to
+; {workDir}/optimize/{name} regardless.
+plot = true
+; variables are defined by bounds and named

[MediaWiki-commits] [Gerrit] Calculate an engine score - change (wikimedia...relevanceForge)

2016-03-04 Thread EBernhardson (Code Review)
EBernhardson has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/275158

Change subject: Calculate an engine score
..

Calculate an engine score

This is based on Paul Nelson's talk on search relevance. It sources
query and click logs from the TestSearchSatisfaction schema, runs the
queries, and then generates an engine score and a histogram of the
result positions.

Change-Id: I167d29e934b7f048e120f908a28e883a77d61c00
---
M .gitignore
A cache/.gitkeep
A engineScore.ini
A engineScore.py
M relevancyRunner.py
5 files changed, 257 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/relevanceForge 
refs/changes/58/275158/1

diff --git a/.gitignore b/.gitignore
index 09ce239..1ffc9ea 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
 relevance/
 .tox/
 *.sw?
+cache/.gitkeep
diff --git a/cache/.gitkeep b/cache/.gitkeep
new file mode 100644
index 000..e69de29
--- /dev/null
+++ b/cache/.gitkeep
diff --git a/engineScore.ini b/engineScore.ini
new file mode 100644
index 000..c3ea8a7
--- /dev/null
+++ b/engineScore.ini
@@ -0,0 +1,17 @@
+[settings]
+stats_server = stat1002.eqiad.wmnet
+mysql_options = 
--defaults-extra-file=/etc/mysql/conf.d/analytics-research-client.cnf --host 
dbstore1002.eqiad.wmnet
+date_start = 2016022200
+date_end = 2016022900
+dwell_threshold = 12
+wiki = enwiki
+num_sessions = 2000
+click_log_cache_dir = ./cache/
+factor = .9
+workDir = ./relevance
+
+[test1]
+name = I need to be better at naming things...
+labHost = searchdemo.eqiad.wmflabs
+searchCommand = cd /srv/mediawiki-vagrant && mwvagrant ssh -- mwscript 
extensions/CirrusSearch/maintenance/runSearch.php --baseName enwiki --fork 2 
--limit 20
+allow_reuse = 1
diff --git a/engineScore.py b/engineScore.py
new file mode 100644
index 000..5e5a7e2
--- /dev/null
+++ b/engineScore.py
@@ -0,0 +1,192 @@
+import os
+import sys
+import json
+import argparse
+import md5
+import tempfile
+import ConfigParser
+import subprocess
+import relevancyRunner
+
+verbose = False
+def debug(string):
+if verbose:
+print(string)
+
+def fetch_query_and_click_logs(settings):
+with open('sql/extract_query_and_click_logs.sql') as f:
+queryFormat = f.read()
+
+query = queryFormat.format(**{
+'date_start': settings('date_start'),
+'date_end': settings('date_end'),
+'dwell_threshold': settings('dwell_threshold'),
+'wiki': settings('wiki'),
+'limit': settings('num_sessions'),
+})
+
+m = md5.new();
+m.update(query)
+hash = m.hexdigest()
+cache_path = settings('click_log_cache_dir') + '/click_log.' + hash
+try:
+with open(cache_path, 'r') as f:
+return f.read().split("\n")
+except IOError:
+pass
+
+p = subprocess.Popen(['ssh', settings('stats_server'), 'mysql ' +
+ settings('mysql_options')],
+ stdin=subprocess.PIPE, stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE)
+
+stdout, stderr = p.communicate(input=query)
+if len(stdout) == 0:
+raise RuntimeError("Could run SQL query:\n%s" % (stderr))
+
+with open(cache_path, 'w') as f:
+f.write(stdout)
+return stdout.split("\n")
+   
+def extract_sessions(sql_result):
+header = sql_result.pop(0)
+sessions = {}
+queries = set()
+clicks = 0
+for line in result:
+if len(line) == 0:
+continue
+sessionId, articleId, query = line.split("\t", 2)
+if sessionId in sessions:
+session = sessions[sessionId]
+else:
+session = {
+'queries': set([]),
+'clicks': set([]),
+}
+sessions[sessionId] = session
+
+if not articleId == 'NULL':
+session['clicks'].update([articleId])
+clicks += 1
+if not query == 'NULL':
+session['queries'].update([query])
+queries.update([query])
+
+print('Loaded %d sessions with %d clicks and %d unique queries' %
+(len(sessions), clicks, len(queries)))
+
+return (sessions, queries)
+
+def genSettings(config):
+def get(key):
+return config.get('settings', key)
+return get
+
+def load_results(results_path):
+# Load the results
+results = {}
+with open(results_path) as f:
+for line in f:
+decoded = json.loads(line)
+hits = []
+if not 'error' in decoded:
+for hit in decoded['rows']:
+hits.append(hit['pageId'])
+results[decoded['query']] = hits
+return results
+
+def calc_engine_score(sessions, results, factor):
+# Calculate a score per session
+session_score = 0.
+histogram = Histogram()
+for sessionId in sessions:
+debug(sessionId)
+session = sessions[sessionId]
+