[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-26 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE closed this task as "Resolved".
Lucas_Werkmeister_WMDE moved this task from Incoming to Done on the Wikidata 
Analytics board.
Lucas_Werkmeister_WMDE claimed this task.
Lucas_Werkmeister_WMDE added a comment.


  The file size also went back up starting from the same date:
  
lucaswerkmeister-wmde@stat1007:~$ ls -lh 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-202112*.gz | tail -15
-rw-r--r-- 1 root root  75K Dec 16 23:59 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211217.gz
-rw-r--r-- 1 root root  50K Dec 17 23:59 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211218.gz
-rw-r--r-- 1 root root  59K Dec 18 23:59 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211219.gz
-rw-r--r-- 1 root root  55K Dec 19 23:59 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211220.gz
-rw-r--r-- 1 root root 6.1M Dec 21 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211221.gz
-rw-r--r-- 1 root root 7.2M Dec 22 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211222.gz
-rw-r--r-- 1 root root 3.2M Dec 23 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211223.gz
-rw-r--r-- 1 root root 3.9M Dec 24 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211224.gz
-rw-r--r-- 1 root root 3.5M Dec 25 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211225.gz
-rw-r--r-- 1 root root 3.7M Dec 26 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211226.gz
-rw-r--r-- 1 root root 7.1M Dec 27 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211227.gz
-rw-r--r-- 1 root root 4.8M Dec 28 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211228.gz
-rw-r--r-- 1 root root 4.2M Dec 29 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211229.gz
-rw-r--r-- 1 root root 3.6M Dec 30 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211230.gz
-rw-r--r-- 1 root root 3.3M Dec 31 00:00 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211231.gz
  
  I think we can call this resolved, we have the proper access logs again. It 
looks like a totally different source (not affected by the Puppet code we 
fixed) is also syncing over some access logs, which now get overwritten again, 
but right now I don’t really want to go hunting for that.

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

WORKBOARD
  https://phabricator.wikimedia.org/project/board/5408/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: BTullis, Manuel, Michael, Lucas_Werkmeister_WMDE, Aklapper, EChetty, 
Invadibot, maantietaja, Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
JAllemandou, terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-26 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment.


  That seems to have fixed the logs; using the same `zgrep` pipeline from the 
task description (now teed into 
`stat1007:~lucaswerkmeister-wmde/wikidatawiki-patterns-accesses-2022-01-26` if 
anyone wants to look at the full output):
  
  1629 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210501.gz
  2784 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210503.gz
  8792 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210504.gz
  7214 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210505.gz
 6 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211029.gz
 1 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211123.gz
  1570 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211221.gz
  1437 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211222.gz
  2471 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211223.gz
473952 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211224.gz
  1802 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211225.gz
  5830 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211226.gz
  1668 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211227.gz
113870 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211228.gz
 84987 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211229.gz
  1645 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211230.gz
  1561 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211231.gz
  
  Looks like a bit over a month’s worth of logs were still available for 
rsyncing over (since results resume 2021-12-21); all the logs in the meantime 
are presumably lost by now.

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: BTullis, Manuel, Michael, Lucas_Werkmeister_WMDE, Aklapper, EChetty, 
Invadibot, maantietaja, Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
JAllemandou, terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-25 Thread Maintenance_bot
Maintenance_bot removed a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Maintenance_bot
Cc: BTullis, Manuel, Michael, Lucas_Werkmeister_WMDE, Aklapper, EChetty, 
Invadibot, maantietaja, Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
JAllemandou, terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb, 786, Suran38, 
Biggs657, Lalamarie69, Juan90264, Alter-paule, Beast1978, Un1tY, Hook696, 
Kent7301, joker88john, CucyNoiD, Gaboe420, Giuliamocci, Cpaulf30, Af420, 
Bsandipan, Lewizho99, Maathavan, Neuronton
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-25 Thread gerritbot
gerritbot added a comment.


  Change 755352 **merged** by RLazarus:
  
  [operations/puppet@production] nginxlogs: Move rsync globs to 
--include/--exclude
  
  https://gerrit.wikimedia.org/r/755352

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: BTullis, Manuel, Michael, Lucas_Werkmeister_WMDE, Aklapper, 786, EChetty, 
Suran38, Biggs657, Invadibot, Lalamarie69, maantietaja, Juan90264, Alter-paule, 
Beast1978, Un1tY, Akuckartz, 4748kitoko, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, 
Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, 
Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, JAllemandou, 
terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-19 Thread gerritbot
gerritbot added a comment.


  Change 755352 had a related patch set uploaded (by Lucas Werkmeister (WMDE); 
author: Lucas Werkmeister (WMDE)):
  
  [operations/puppet@production] nginxlogs: Use shell to expand glob
  
  https://gerrit.wikimedia.org/r/755352

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: Manuel, Michael, Lucas_Werkmeister_WMDE, Aklapper, 786, EChetty, Suran38, 
Biggs657, Invadibot, Lalamarie69, maantietaja, Juan90264, Alter-paule, 
Beast1978, Un1tY, Akuckartz, 4748kitoko, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, 
Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, 
Maathavan, _jensen, rosalieper, Scott_WUaS, JAllemandou, terrrydactyl, 
Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-19 Thread gerritbot
gerritbot added a comment.


  Change 755347 **merged** by Btullis:
  
  [operations/puppet@production] Add single quotes to the wildcard for rsyncing 
nginxlogs
  
  https://gerrit.wikimedia.org/r/755347

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: Manuel, Michael, Lucas_Werkmeister_WMDE, Aklapper, 786, EChetty, Suran38, 
Biggs657, Invadibot, Lalamarie69, maantietaja, Juan90264, Alter-paule, 
Beast1978, Un1tY, Akuckartz, 4748kitoko, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, 
Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, 
Maathavan, _jensen, rosalieper, Scott_WUaS, JAllemandou, terrrydactyl, 
Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-19 Thread gerritbot
gerritbot added a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: Manuel, Michael, Lucas_Werkmeister_WMDE, Aklapper, 786, EChetty, Suran38, 
Biggs657, Invadibot, Lalamarie69, maantietaja, Juan90264, Alter-paule, 
Beast1978, Un1tY, Akuckartz, 4748kitoko, Hook696, Kent7301, joker88john, 
CucyNoiD, Nandana, Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, 
Af420, Bsandipan, GoranSMilovanovic, QZanden, LawExplorer, Lewizho99, 
Maathavan, _jensen, rosalieper, Scott_WUaS, JAllemandou, terrrydactyl, 
Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-19 Thread gerritbot
gerritbot added a comment.


  Change 755347 had a related patch set uploaded (by Btullis; author: Btullis):
  
  [operations/puppet@production] Add single quotes to the wildcard for rsyncing 
nginxlogs
  
  https://gerrit.wikimedia.org/r/755347

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: Manuel, Michael, Lucas_Werkmeister_WMDE, Aklapper, EChetty, Invadibot, 
maantietaja, Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
JAllemandou, terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-17 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment.


  @Michael speculates that there might be multiple servers rsyncing their 
access logs to stat1007 and overwriting each other’s data; as far as I can tell 
in Puppet, only one “primary server” is supposed to copy its logs 
,
 but apart from that it certainly feels like a plausible explanation to me. 
Such a race condition between multiple servers might explain why it also 
sometimes happened prior to 2021-05-05.
  
  (By the way, I looked through the SAL and puppet.git logs around 2021-05-05, 
but nothing jumped out at me as looking significant.)

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Michael, Lucas_Werkmeister_WMDE, Aklapper, EChetty, Invadibot, maantietaja, 
Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, JAllemandou, 
terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-17 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a parent task: T292621: Fix stats for Wikidata 
dump downloads dashboard.

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Lucas_Werkmeister_WMDE, Aklapper, EChetty, Invadibot, maantietaja, 
Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, JAllemandou, 
terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-17 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment.


  The requests prior to 2021-05-05 came from a variety of user agents, so I’m 
willing to rule out the possibility that there were genuinely fewer accesses 
after that date; I think it’s okay to post the top user agents (none of them 
look very private):
  
$ zgrep -E -e '(latest|wikidata-[0-9]{8})-all\.json\.(gz|bz2)' -e 
'(latest|wikidata-[0-9]{8})-all-BETA\.ttl\.(gz|bz2)' -e 
'wikidatawiki-(latest|[0-9]{8})-(pages-articles-multistream|pages-meta-history|pages-meta-current|pages-articles)1?\.xml\.(gz|bz2)'
 -e 'wikidatawiki-[0-9]{8}-pages-meta-hist-incr.xml\.(gz|bz2)' 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210[45]*.gz | awk 
'$9 == "200" || $9 == "206"' | awk -F'"' '{ print $6 }' | sort | uniq -c | sort 
-rn | head -25
  11688 Go-http-client/1.1
   6817 Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)
   2565 Mozilla/5.0 (compatible; YandexOntoDB/1.0; +http://yandex.com/bots)
   1259 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36
980 Apache-HttpClient/4.5.9 (Java/1.8.0_265)
726 Apache-HttpClient/4.5.2 (Java/1.8.0_262)
707 curl/7.29.0
560 aria2/1.34.0
493 Wget/1.19.4 (linux-gnu)
421 python-requests/2.23.0
387 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
329 Wget/1.21.1
136 python-requests/2.25.1
129 Wget/1.20.3 (linux-gnu)
122 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) QQBrowser/6.0
119 -
109 Mozilla/5.0 (compatible;AspiegelBot)
106 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36
 85 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36
 83 Wget/1.14 (linux-gnu)
 72 Wget/1.20.1 (linux-gnu)
 70 aria2/1.35.0
 65 Python-urllib/3.7
 62 python-requests/2.21.0
 62 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36
  
  I also did a download myself, with a custom user agent, just to be sure:
  
$ curl -H 'User-Agent: Lucas-Werkmeister (really wants to see this request 
in the access logs lucas.werkmeis...@wikimedia.de) curl/7.74.0' 
https://dumps.wikimedia.org/wikidatawiki/entities/20220110/wikidata-20220110-all.json.bz2
 > /dev/null
  
  That was on Friday evening, and there’s no trace of it in the access logs:
  
$ zgrep -c Lucas 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-2022011{4..6}.gz
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20220114.gz:0
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20220115.gz:0
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20220116.gz:0

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Lucas_Werkmeister_WMDE, Aklapper, EChetty, Invadibot, maantietaja, 
Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, JAllemandou, 
terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T299358: dumps.wikimedia.org access logs on stat1007 are incomplete since May 2021 (possibly earlier)

2022-01-17 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE created this task.
Lucas_Werkmeister_WMDE added projects: Wikidata, Wikidata Analytics, Analytics.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  The `/srv/log/webrequest/archive/dumps.wikimedia.org/` directory on stat1007 
contains gzipped access logs for dumps.wikimedia.org, one per day. The wmde 
analytics script dumpDownloads.php 

 analyzes those files looking for accesses to certain files (Wikidata dumps), 
but since May 2021, there is barely anything to find there:
  
$ zgrep -F /wikidatawiki/ 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-2021*.gz | grep -E 
-e '(latest|wikidata-[0-9]{8})-all\.json\.(gz|bz2)' -e 
'(latest|wikidata-[0-9]{8})-all-BETA\.ttl\.(gz|bz2)' -e 
'wikidatawiki-(latest|[0-9]{8})-(pages-articles-multistream|pages-meta-history|pages-meta-current|pages-articles)1?\.xml\.(gz|bz2)'
 -e 'wikidatawiki-[0-9]{8}-pages-meta-hist-incr.xml\.(gz|bz2)' | cut -d: -f1 | 
uniq -c | tee ~lucaswerkmeister-wmde/wikidatawiki-patterns-accesses
# snip many results, up to 700k matches per file, up to and including April 
2021
   1629 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210501.gz
   2784 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210503.gz
   8792 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210504.gz
   7214 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210505.gz
  6 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211029.gz
  1 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211123.gz
  1 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20211231.gz
  
  5 May 2021 is the last day with any significant number of results. The file 
size of the dumps also sharply dropped after that point:
  
lucaswerkmeister-wmde@stat1007:~$ ls -lh 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-202105*.gz | head
-rw-r--r-- 1 root root 6.6M May  1  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210501.gz
-rw-r--r-- 1 root root 128K May  1  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210502.gz
-rw-r--r-- 1 root root 9.6M May  3  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210503.gz
-rw-r--r-- 1 root root 6.1M May  4  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210504.gz
-rw-r--r-- 1 root root 6.1M May  5  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210505.gz
-rw-r--r-- 1 root root 139K May  6  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210506.gz
-rw-r--r-- 1 root root 145K May  6  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210507.gz
-rw-r--r-- 1 root root 125K May  7  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210508.gz
-rw-r--r-- 1 root root 126K May  8  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210509.gz
-rw-r--r-- 1 root root 134K May  9  2021 
/srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20210510.gz
  
  However, while the problem seems to have gotten much worse after 5 May 2021, 
it didn’t start then; for instance, observe `access.log-20210502.gz` above, 
which was already smaller and also had no grep results. In April 2021, files 
for 11 days contain matches (specifically: 04, 05, 07, 09, 11, 14, 15, 17, 21, 
25, 28), and those files are again larger than the other ones (ca. 6-10 MB 
rather than 80-125 kB).

TASK DETAIL
  https://phabricator.wikimedia.org/T299358

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Lucas_Werkmeister_WMDE, Aklapper, EChetty, Invadibot, maantietaja, 
Akuckartz, 4748kitoko, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, JAllemandou, 
terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org