DCausse has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/268704

Change subject: hive query to extract sample query set
......................................................................

hive query to extract sample query set

This query strategy is to catch near_match queries followed by a full_text.
It should track queries sent by desktop (mobile web?) users when hit enter
in the top right search box.
We can't really use full_text directly, it's polluted by tons of partial
queries (Mobile app).
In order to filter bots the query will take only one query by ip for one day.
It seems to work but I'm not 100% sure.

Bug: T125825
Change-Id: Ice89a203186e2e3c35e0108588c18fcacb1cfbc4
---
A misc/fulltextQueriesSample.hsql
1 file changed, 65 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/relevancylab 
refs/changes/04/268704/1

diff --git a/misc/fulltextQueriesSample.hsql b/misc/fulltextQueriesSample.hsql
new file mode 100644
index 0000000..53f4730
--- /dev/null
+++ b/misc/fulltextQueriesSample.hsql
@@ -0,0 +1,65 @@
+-- Random fulltext queries sample
+--
+-- Queries useful to extract a sample query set of fulltext queries
+-- We use query set that includes a near_match, which are likely run
+-- when the user hit enter on the top right search box on the desktop
+-- site. If the near_match query set includes a full_text query then
+-- it's probable that the user is redirected to a search results page.
+-- We can't really use full_text directly since the WikipediaApp sends
+-- partial queries with partial words (search type ahead) which are
+-- completely useless for measuring fulltext queries performances.
+-- To avoid including too many automata queries we accept only one query
+-- per ip-day.
+
+-- Random sample of 1000 queries that return more than 500 results
+-- on enwiki over one week
+
+SET year=2016;
+SET month=1;
+SET day_min=25;
+SET day_max=31;
+SET min_res=500;
+SET wiki='enwiki';
+SET index='enwiki_content';
+SET multi_word_regex='\\S\\s+\\S';
+SET single_word_regex='^\\s+$';
+SET query_regex=${hiveconf:multi_word_regex};
+
+SELECT DISTINCT(q) FROM (
+       SELECT
+               -- keep only one query at random per ip/day
+               FIRST_VALUE(areq.query) over (
+                       PARTITION by csr.ip, csr.day
+                       ORDER BY rand()
+               ) as q
+       FROM
+               CirrusSearchRequestSet csr
+               -- Explode the requests array so we can extract the
+               -- last full_text query
+               LATERAL VIEW EXPLODE(requests) req as areq
+       WHERE
+               year = ${hiveconf:year} AND month = ${hiveconf:month}
+               AND day >= ${hiveconf:day_min} and day <= ${hiveconf:day_max}
+
+               -- When the user hit enter it generates a near_match query 
first.
+               AND csr.requests[0].queryType = 'near_match'
+
+               -- Filter the full_text query with more than 500 results
+               AND areq.queryType = 'full_text'
+               AND areq.hitstotal > ${hiveconf:min_res}
+
+               -- Make sure we extract only enwiki_content
+               AND size(areq.indices) == 1
+               AND areq.indices[0] = ${hiveconf:index}
+               AND wikiid=${hiveconf:wiki}
+
+               AND areq.query RLIKE ${hiveconf:query_regex}
+
+               -- TODO: make sure we don't get a did you mean
+               -- rewritten query.
+) queries
+-- Various stuff stolen from https://www.joefkelley.com/?p=736
+DISTRIBUTE BY RAND()
+SORT BY rand()
+LIMIT 1000;
+

-- 
To view, visit https://gerrit.wikimedia.org/r/268704
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ice89a203186e2e3c35e0108588c18fcacb1cfbc4
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/relevancylab
Gerrit-Branch: master
Gerrit-Owner: DCausse <dcau...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to