OliverKeyes has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/258197

Change subject: Add spider detection to WDQS data collection code
......................................................................

Add spider detection to WDQS data collection code

What it says on the tin.

Bug: T121106
Change-Id: If36023abff3486532a6989157aff94ca1c075b0b
---
M wdqs/basic_usage.R
1 file changed, 5 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/golden 
refs/changes/97/258197/1

diff --git a/wdqs/basic_usage.R b/wdqs/basic_usage.R
index 41b2988..cd56174 100644
--- a/wdqs/basic_usage.R
+++ b/wdqs/basic_usage.R
@@ -21,23 +21,25 @@
   query <- paste0("USE wmf;
                    SELECT year, month, day, uri_path,
                    UPPER(http_status IN('200','304')) as success,
+                   CASE WHEN agent_type = 'spider' THEN 'TRUE' ELSE 'FALSE' 
END AS is_automata,
                    COUNT(*) AS n
                    FROM webrequest",
                    subquery,
                   "AND webrequest_source = 'misc'
                    AND uri_host = 'query.wikidata.org'
                    AND uri_path IN('/', '/bigdata/namespace/wdq/sparql')
-                   GROUP BY year, month, day, uri_path,
-                   UPPER(http_status IN('200','304'));")
+                   GROUP BY year, month, day, uri_path, UPPER(http_status 
IN('200','304')),
+                   CASE WHEN agent_type = 'spider' THEN 'TRUE' ELSE 'FALSE' 
END;")
   results <- query_hive(query)
 
   output <- data.frame(date = as.Date(paste(results$year, results$month, 
results$day, sep = "-")),
                        path = results$uri_path,
                        http_success = results$success,
+                       is_automata = results$is_automata,
                        events = results$n,
                        stringsAsFactors = FALSE)
 
   # Write out
-  conditional_write(output, file.path(base_path, "wdqs_aggregates.tsv"))
+  conditional_write(output, file.path(base_path, "wdqs_aggregates_new.tsv"))
 
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/258197
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: If36023abff3486532a6989157aff94ca1c075b0b
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/discovery/golden
Gerrit-Branch: master
Gerrit-Owner: OliverKeyes <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to