[MediaWiki-commits] [Gerrit] Calculate an engine score - change (wikimedia...relevanceForge)
Tjones has submitted this change and it was merged. Change subject: Calculate an engine score .. Calculate an engine score This is based on Paul Nelson's talk on search relevance. It sources query and click logs from the TestSearchSatisfaction schema, runs the queries, and then generates an engine score and a histogram of the result positions. Includes a basic integration with scipy.optimize to do a brute force search over a grid. Graphs the result when optimizing a 1 or 2 dimensional search space. Also includes an nDCG implementation, but we don't yet have a source of data to get the scores to use in nDCG with. Change-Id: I167d29e934b7f048e120f908a28e883a77d61c00 --- M .gitignore M README.md A engineScore.ini A engineScore.py M relevancyRunner.py A sql/extract_query_and_click_logs.all_clicks.sql A sql/extract_query_and_click_logs.sat_clicks.sql 7 files changed, 516 insertions(+), 28 deletions(-) Approvals: Tjones: Verified; Looks good to me, approved diff --git a/.gitignore b/.gitignore index 09ce239..28d03ca 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ relevance/ .tox/ *.sw? +*.pyc diff --git a/README.md b/README.md index b4c9f53..42a35df 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,8 @@ * Python: There's nothing too fancy here, and it works with Python 2.7, though a few packages are required: * The packages `jsonpath-rw, numpy` and `matplotlib` are required by the main Rel Lab. * The package `termcolor` is required by the Cirrus Query Debugger. + * The package `scipy` is required by the Engine Score Optimizer + * The package `matplotlib` is required by the Engine Score Optimizer * If you don't have one of these packages, you can get it with `pip install ` (`sudo` may be required to install packages). * SSH access to the host you intend to connect to. @@ -181,6 +183,14 @@ Segment specification: The segment specification is an array of tuples, `[x, y, slope]`. `(x,y)` is the end of one line segment and the beginning of the next (except for the first and last tuple, naturally), and `slope` is the slope of the line segment from the previous endpoint (used in the generic and custom functions to save on re-computing (y[i]-y[i-1])/(x[i]-x[i-1]) for every single evaluation). The first point is given a `slope` of 0, though it isn't used. +### Engine Scoring Optimizer + +The Engine Scoring Optimizer ( engineScore.py ) generates a single number representing the score of the engine for the query set. It can combine this calculation with scipy brute force optimization to explore a multi-dimensional space of numeric config values to attempt to find the best values. This works similar to the main relevancy runner, which is reused here for running the queries. + +The Engine Scoring process takes an `.ini` file similar to the main relevancy runner: + +engineScore.py -c engineScore.ini + ### Miscellaneous The `misc/` directory contains additional useful stuff: diff --git a/engineScore.ini b/engineScore.ini new file mode 100644 index 000..41a74ab --- /dev/null +++ b/engineScore.ini @@ -0,0 +1,41 @@ +[settings] +stats_server = stat1002.eqiad.wmnet +mysql_options = --defaults-extra-file=/etc/mysql/conf.d/analytics-research-client.cnf --host dbstore1002.eqiad.wmnet +; This query will be formatted with python's string.Formatter using the variables from within this settings group +query = ./sql/extract_query_and_click_logs.sat_clicks.sql +; controls the importance of result position within the set. Must be between 0 and 1 exclusive. Higher values +; place less weight on the position. .9 seems reasonable for a 20 item result set. +factor = .9 +workDir = ./relevance +; The rest of the variables here are used within the sql query +date_start = 2016022200 +date_end = 2016022900 +dwell_threshold = 10 +wiki = enwiki +num_sessions = 2000 + +[test1] +name = I need to be better at naming things...tfidf-visitPage +config = ./conf.json +labHost = suggesty.eqiad.wmflabs +searchCommand = sudo -u www-data hhvm /var/www/w/MWScript.php extensions/CirrusSearch/maintenance/runSearch.php --baseName enwiki --limit 20 --cluster nobelium --fork 8 +;labHost = searchdemo.eqiad.wmflabs +;searchCommand = cd /srv/mediawiki-vagrant && mwvagrant ssh -- mwscript extensions/CirrusSearch/maintenance/runSearch.php --baseName enwiki --fork 12 --limit 20 --cluster labsearch + +; brute force optimization. requires scipy and matplotlib. This section +; may be omitted to calculate the engine score a single time. +[optimize] +; Set the minimum and maximum values of the variables +bounds = [[1, 128], [64, 1024]] +; Ns is a set of bounds which will be interpolated into Ns points +; between the low and high values of bounds, inclusive. +Ns = 3 +; display graph of the results? A result graph will be written out to +; {workDir}/optimize/{name} regardless. +plot = true +; variables are defined by bounds and named
[MediaWiki-commits] [Gerrit] Calculate an engine score - change (wikimedia...relevanceForge)
EBernhardson has uploaded a new change for review. https://gerrit.wikimedia.org/r/275158 Change subject: Calculate an engine score .. Calculate an engine score This is based on Paul Nelson's talk on search relevance. It sources query and click logs from the TestSearchSatisfaction schema, runs the queries, and then generates an engine score and a histogram of the result positions. Change-Id: I167d29e934b7f048e120f908a28e883a77d61c00 --- M .gitignore A cache/.gitkeep A engineScore.ini A engineScore.py M relevancyRunner.py 5 files changed, 257 insertions(+), 34 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/relevanceForge refs/changes/58/275158/1 diff --git a/.gitignore b/.gitignore index 09ce239..1ffc9ea 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ relevance/ .tox/ *.sw? +cache/.gitkeep diff --git a/cache/.gitkeep b/cache/.gitkeep new file mode 100644 index 000..e69de29 --- /dev/null +++ b/cache/.gitkeep diff --git a/engineScore.ini b/engineScore.ini new file mode 100644 index 000..c3ea8a7 --- /dev/null +++ b/engineScore.ini @@ -0,0 +1,17 @@ +[settings] +stats_server = stat1002.eqiad.wmnet +mysql_options = --defaults-extra-file=/etc/mysql/conf.d/analytics-research-client.cnf --host dbstore1002.eqiad.wmnet +date_start = 2016022200 +date_end = 2016022900 +dwell_threshold = 12 +wiki = enwiki +num_sessions = 2000 +click_log_cache_dir = ./cache/ +factor = .9 +workDir = ./relevance + +[test1] +name = I need to be better at naming things... +labHost = searchdemo.eqiad.wmflabs +searchCommand = cd /srv/mediawiki-vagrant && mwvagrant ssh -- mwscript extensions/CirrusSearch/maintenance/runSearch.php --baseName enwiki --fork 2 --limit 20 +allow_reuse = 1 diff --git a/engineScore.py b/engineScore.py new file mode 100644 index 000..5e5a7e2 --- /dev/null +++ b/engineScore.py @@ -0,0 +1,192 @@ +import os +import sys +import json +import argparse +import md5 +import tempfile +import ConfigParser +import subprocess +import relevancyRunner + +verbose = False +def debug(string): +if verbose: +print(string) + +def fetch_query_and_click_logs(settings): +with open('sql/extract_query_and_click_logs.sql') as f: +queryFormat = f.read() + +query = queryFormat.format(**{ +'date_start': settings('date_start'), +'date_end': settings('date_end'), +'dwell_threshold': settings('dwell_threshold'), +'wiki': settings('wiki'), +'limit': settings('num_sessions'), +}) + +m = md5.new(); +m.update(query) +hash = m.hexdigest() +cache_path = settings('click_log_cache_dir') + '/click_log.' + hash +try: +with open(cache_path, 'r') as f: +return f.read().split("\n") +except IOError: +pass + +p = subprocess.Popen(['ssh', settings('stats_server'), 'mysql ' + + settings('mysql_options')], + stdin=subprocess.PIPE, stdout=subprocess.PIPE, + stderr=subprocess.PIPE) + +stdout, stderr = p.communicate(input=query) +if len(stdout) == 0: +raise RuntimeError("Could run SQL query:\n%s" % (stderr)) + +with open(cache_path, 'w') as f: +f.write(stdout) +return stdout.split("\n") + +def extract_sessions(sql_result): +header = sql_result.pop(0) +sessions = {} +queries = set() +clicks = 0 +for line in result: +if len(line) == 0: +continue +sessionId, articleId, query = line.split("\t", 2) +if sessionId in sessions: +session = sessions[sessionId] +else: +session = { +'queries': set([]), +'clicks': set([]), +} +sessions[sessionId] = session + +if not articleId == 'NULL': +session['clicks'].update([articleId]) +clicks += 1 +if not query == 'NULL': +session['queries'].update([query]) +queries.update([query]) + +print('Loaded %d sessions with %d clicks and %d unique queries' % +(len(sessions), clicks, len(queries))) + +return (sessions, queries) + +def genSettings(config): +def get(key): +return config.get('settings', key) +return get + +def load_results(results_path): +# Load the results +results = {} +with open(results_path) as f: +for line in f: +decoded = json.loads(line) +hits = [] +if not 'error' in decoded: +for hit in decoded['rows']: +hits.append(hit['pageId']) +results[decoded['query']] = hits +return results + +def calc_engine_score(sessions, results, factor): +# Calculate a score per session +session_score = 0. +histogram = Histogram() +for sessionId in sessions: +debug(sessionId) +session = sessions[sessionId] +