EBernhardson has uploaded a new change for review. https://gerrit.wikimedia.org/r/317019
Change subject: [WIP] Calculate click data for top queries ...................................................................... [WIP] Calculate click data for top queries Finds queries issued more than 10 times in the aggregation period and collects the clicks performed by users. This is the first step in figuring out a way to judge engine relevance based on user click throughs. The paper for Expected Reciprocal Rank suggests that the probability could be calculated from click logs, but i havn't been able to find much on how people have actually done that. Possible issues: * Very naive grouping of queries, would be better if we could apply stemming of some sort but that's probably very difficult Work needed: * Needs all of the appropriate oozie integration * Needs to be parameterized * Need to decide on the size of the aggregation period * Need to define a table that the data will be stored to Follow ups: * Based on the literature, while click through data is not all that * useful for directly generating labels such as used in nDCG or ERR, it * can be used to generate relative grades. For example if a user clicks * the third link we don't know if that's better than the 4th link, but * it is probably better than the first. This gives the tuples (q, r3, * r1), (q, r3, r2) meaning for query q, result 3 is better than result * 1. Change-Id: I09f253849d8a1d28a3c26dc6b0f60233074d6a90 --- A oozie/top_query_clicks/top_query_clicks.hql 1 file changed, 56 insertions(+), 0 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/analytics refs/changes/19/317019/1 diff --git a/oozie/top_query_clicks/top_query_clicks.hql b/oozie/top_query_clicks/top_query_clicks.hql new file mode 100644 index 0000000..649ea7d --- /dev/null +++ b/oozie/top_query_clicks/top_query_clicks.hql @@ -0,0 +1,56 @@ +SELECT + query, + page_id, + identity, + ts AS click_timestamp, + pageview_info['project'] AS project, + hits +FROM ( + SELECT + query, + metadata.identity AS identity, + metadata.request_set_token AS request_set_token, + metadata.hits AS hits + FROM ( + SELECT + LOWER(requests[SIZE(requests)-1].query) AS query, + COUNT(DISTINCT ip) AS num_searches, + COLLECT_LIST(NAMED_STRUCT('request_set_token', id, 'hits', hits, 'identity', identity)) AS query_meta, + wikiid + FROM + wmf_raw.cirrussearchrequestset + WHERE + year=2016 + AND month=10 + AND day=10 + AND hour=10 + -- TODO: map from wikiid to pageview_info['project'] + AND wikiid='enwiki' + AND requests[SIZE(requests)-1].querytype = 'full_text' + AND source = 'web' + GROUP BY + wikiid, + -- would be nice if these could somehow be stemmed...but thats probably + -- asking a bit much of hive without writing some crazy UDFs to pull in + -- lucene and somehow use the same config we have in elasticsearch + -- examples: '2016 in film', '2016 in films' + LOWER(requests[SIZE(requests)-1].query) + HAVING + num_searches > 10 + ) inner_a + LATERAL VIEW + EXPLODE(query_meta) meta AS metadata + ) inner_b +JOIN + wmf.webrequest +ON + year=2016 + AND month=10 + AND day=10 + AND hour=10 + AND PARSE_URL(referer, 'QUERY') IS NOT NULL + AND STR_TO_MAP(PARSE_URL(referer, 'QUERY'), '&', '=')['searchToken'] = request_set_token + AND is_pageview = TRUE + AND page_id IS NOT NULL + -- TODO: map from wikiid to pageview_info['project'] + AND pageview_info['project'] = 'en.wikipedia' -- To view, visit https://gerrit.wikimedia.org/r/317019 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I09f253849d8a1d28a3c26dc6b0f60233074d6a90 Gerrit-PatchSet: 1 Gerrit-Project: wikimedia/discovery/analytics Gerrit-Branch: master Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits