[MediaWiki-commits] [Gerrit] wikimedia...analytics[master]: [WIP] Calculate click data for top queries

EBernhardson (Code Review) Thu, 20 Oct 2016 13:28:09 -0700

EBernhardson has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/317019


Change subject: [WIP] Calculate click data for top queries
......................................................................

[WIP] Calculate click data for top queries

Finds queries issued more than 10 times in the aggregation period and
collects the clicks performed by users. This is the first step in
figuring out a way to judge engine relevance based on user click
throughs.  The paper for Expected Reciprocal Rank suggests that the
probability could be calculated from click logs, but i havn't been able
to find much on how people have actually done that.

Possible issues:
* Very naive grouping of queries, would be better if we could apply
  stemming of some sort but that's probably very difficult

Work needed:
* Needs all of the appropriate oozie integration
* Needs to be parameterized
* Need to decide on the size of the aggregation period
* Need to define a table that the data will be stored to

Follow ups:
* Based on the literature, while click through data is not all that
* useful for directly generating labels such as used in nDCG or ERR, it
* can be used to generate relative grades. For example if a user clicks
* the third link we don't know if that's better than the 4th link, but
* it is probably better than the first. This gives the tuples (q, r3,
* r1), (q, r3, r2) meaning for query q, result 3 is better than result
* 1.

Change-Id: I09f253849d8a1d28a3c26dc6b0f60233074d6a90
---
A oozie/top_query_clicks/top_query_clicks.hql
1 file changed, 56 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/analytics 
refs/changes/19/317019/1

diff --git a/oozie/top_query_clicks/top_query_clicks.hql 
b/oozie/top_query_clicks/top_query_clicks.hql
new file mode 100644
index 0000000..649ea7d
--- /dev/null
+++ b/oozie/top_query_clicks/top_query_clicks.hql
@@ -0,0 +1,56 @@
+SELECT
+    query,
+    page_id,
+    identity,
+    ts AS click_timestamp,
+    pageview_info['project'] AS project,
+    hits
+FROM (
+    SELECT
+        query,
+        metadata.identity AS identity,
+        metadata.request_set_token AS request_set_token,
+        metadata.hits AS hits
+    FROM (
+        SELECT
+            LOWER(requests[SIZE(requests)-1].query) AS query,
+            COUNT(DISTINCT ip) AS num_searches,
+            COLLECT_LIST(NAMED_STRUCT('request_set_token', id, 'hits', hits, 
'identity', identity)) AS query_meta,
+            wikiid
+        FROM
+            wmf_raw.cirrussearchrequestset
+        WHERE
+            year=2016 
+            AND month=10
+            AND day=10
+            AND hour=10
+            -- TODO: map from wikiid to pageview_info['project']
+            AND wikiid='enwiki'
+            AND requests[SIZE(requests)-1].querytype = 'full_text'
+            AND source = 'web'
+        GROUP BY
+            wikiid,
+            -- would be nice if these could somehow be stemmed...but thats 
probably
+            -- asking a bit much of hive without writing some crazy UDFs to 
pull in
+            -- lucene and somehow use the same config we have in elasticsearch
+            -- examples: '2016 in film', '2016 in films'
+            LOWER(requests[SIZE(requests)-1].query)
+        HAVING
+            num_searches > 10
+        ) inner_a
+    LATERAL VIEW
+        EXPLODE(query_meta) meta AS metadata
+    ) inner_b
+JOIN
+    wmf.webrequest
+ON
+    year=2016 
+    AND month=10
+    AND day=10
+    AND hour=10
+    AND PARSE_URL(referer, 'QUERY') IS NOT NULL
+    AND STR_TO_MAP(PARSE_URL(referer, 'QUERY'), '&', '=')['searchToken'] = 
request_set_token
+    AND is_pageview = TRUE
+    AND page_id IS NOT NULL
+    -- TODO: map from wikiid to pageview_info['project']
+    AND pageview_info['project'] = 'en.wikipedia'

-- 
To view, visit https://gerrit.wikimedia.org/r/317019
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I09f253849d8a1d28a3c26dc6b0f60233074d6a90
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/analytics
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <ebernhard...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] wikimedia...analytics[master]: [WIP] Calculate click data for top queries

Reply via email to