Re: [Analytics] [Release] explore the (parsed) (common) Wikimedia user agents

2015-03-09 Thread Oliver Keyes
1%...of the browsers that made it through the minimum request count filter ;). But crawler-traffic overall is actually ~50% of US desktop traffic, for scale. We get a lot of hits (not so much from Google, who crawl in a smart way, as Bing, who crawl in a very dumb way) On 9 March 2015 at 22:54, Ti

Re: [Analytics] [Release] explore the (parsed) (common) Wikimedia user agents

2015-03-09 Thread Timo Tijhof
Wow, does Googlebot really represent over 1% of our desktop/reader traffic? Rather interesting compared to that of e.g. WinXP/IE6, which is over 60x smaller at 0.016%. But never mind IE6's percentage, that of Google would seem quite high. — Timo On 6 Mar 2015, at 01:02, Oliver Keyes wrote: >

Re: [Analytics] Anomalies in pagecounts files?

2015-03-09 Thread Oliver Keyes
What do you mean by help? Provide assistance in building the replacement systems we're building? On 9 March 2015 at 18:49, Roni Wiener wrote: > > Thanks for the info, both your points can explain the anomalies I saw. > > The mirroring issue can explain the reason why I see many *.mp3 and .*_ep

[Analytics] Eventlogging backfilling for outage 02/04 02/10 done.

2015-03-09 Thread Nuria Ruiz
Team: Eventlogging backfilling for outage 02/04 to 02/10 is done. Some events were filled from raw logs, some from processed logs. Because most of the "droppage" happened intermittently the backfilling just re-run the events from 02/04 to 02/10 one by one. Here are the descriptions of the two in

Re: [Analytics] [Cluster] Monitoring the impact Hive jobs have on the Analytics cluster

2015-03-09 Thread Christian Aistleitner
Hi Andrew, On Mon, Mar 09, 2015 at 11:54:56AM -0400, Andrew Otto wrote: > > https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load > Christian, may I move this page into the Cluster/Hadoop/Administration page? I think a separate page is worth it as the target audience is different from

Re: [Analytics] Anomalies in pagecounts files?

2015-03-09 Thread Roni Wiener
Thanks for the info, both your points can explain the anomalies I saw. The mirroring issue can explain the reason why I see many *.mp3 and .*_ep titles in the pagecounts files that do not correlate to any Wikipedia page, probably spammers monetizing music. How can I help resolving these issue

Re: [Analytics] [Technical] missing dialect subdomains in the new pageviews definition

2015-03-09 Thread Dario Taraborelli
thanks, Oliver (and James for spotting this). > On Mar 9, 2015, at 2:30 PM, Oliver Keyes wrote: > > Now logged in Phabricator at https://phabricator.wikimedia.org/T92020 > > On 9 March 2015 at 16:24, Oliver Keyes wrote: >> Bah; folder names, rather than subdomains. >> >> On 9 March 2015 at 16

Re: [Analytics] [Technical] missing dialect subdomains in the new pageviews definition

2015-03-09 Thread Oliver Keyes
Now logged in Phabricator at https://phabricator.wikimedia.org/T92020 On 9 March 2015 at 16:24, Oliver Keyes wrote: > Bah; folder names, rather than subdomains. > > On 9 March 2015 at 16:24, Oliver Keyes wrote: >> Hey all, >> >> One of the big improvements of the new definition over the old one

Re: [Analytics] Provenance Params

2015-03-09 Thread Adam Baso
Okay, we'll plan on wprov. On Wed, Mar 4, 2015 at 12:44 PM, Dan Garry wrote: > Works for me. > > Dan > > On 4 March 2015 at 12:33, Adam Baso wrote: > >> How about 'wprov'? >> >> On Wed, Mar 4, 2015 at 12:29 PM, Dan Garry wrote: >> >>> I'd really rather this be either something that's totally n

Re: [Analytics] [Technical] missing dialect subdomains in the new pageviews definition

2015-03-09 Thread Oliver Keyes
Bah; folder names, rather than subdomains. On 9 March 2015 at 16:24, Oliver Keyes wrote: > Hey all, > > One of the big improvements of the new definition over the old one is > that the old one is not limited to /wiki/. It includes all of the > chinese and serbian dialects that have their own fold

[Analytics] [Technical] missing dialect subdomains in the new pageviews definition

2015-03-09 Thread Oliver Keyes
Hey all, One of the big improvements of the new definition over the old one is that the old one is not limited to /wiki/. It includes all of the chinese and serbian dialects that have their own folder names and were not appearing, as a result, in the old pageview counts. James F (thanks James!) r

Re: [Analytics] Anomalies in pagecounts files?

2015-03-09 Thread Oliver Keyes
Well, the raw Double-entry_bookkeeping_system only has 14k views in that hour, so I have to assume that (55k-14k) views are coming from some oddly localised URI. Not sanitising input is...one of the many things we should fix. But, I would warn you that this is likely automata. Some things I have s

Re: [Analytics] [Cluster] Monitoring the impact Hive jobs have on the Analytics cluster

2015-03-09 Thread Nuria Ruiz
>Aside from this, I get daily emails about webrequest partition statuses, and I would at least notice the morning after that something is wrong. Right, but in the case of Friday that would mean perhaps having to backfill a bunch of data up to Saturday morning, whereas if we have alarms we can detec

Re: [Analytics] Anomalies in pagecounts files?

2015-03-09 Thread Oliver Keyes
It's more likely that it's just an attack by automata, rather than a sharp peak of genuine interest. Since 20150306 is within the last 30 days I can look and check, and will do so now. On 8 March 2015 at 15:18, Roni Wiener wrote: > Hi > > I was goofing around with the Wikipedia page counts dumps

Re: [Analytics] [Cluster] Monitoring the impact Hive jobs have on the Analytics cluster

2015-03-09 Thread Andrew Otto
> Should have icinga alarms arround these types of issues? Seems like that > would be the way to go. Aside from this, I get daily emails about webrequest partition statuses, and I would at least notice the morning after that something is wrong. > On Mar 7, 2015, at 21:20, Nuria Ruiz wrote:

Re: [Analytics] [Cluster] Monitoring the impact Hive jobs have on the Analytics cluster

2015-03-09 Thread Andrew Otto
> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load Christian, may I move this page into the Cluster/Hadoop/Administration page? > Should have icinga alarms arround these types of issues? Seems like that > would be the way to go. We used to have icinga alarms based on webrequest

[Analytics] Anomalies in pagecounts files?

2015-03-09 Thread Roni Wiener
Hi I was goofing around with the Wikipedia page counts dumps and noticed some strange anomalies. For example: The page "Double-entry_bookkeeping_system" had 55921 page views on pagecounts-20150306-07.gz Where it only had 54 views on pagecounts-20150306-10.gz (3 hours later). Is there a b

Re: [Analytics] [Cluster] Monitoring the impact Hive jobs have on the Analytics cluster

2015-03-09 Thread Christian Aistleitner
Hi Pine, On Sat, Mar 07, 2015 at 08:15:18PM -0800, Pine W wrote: > Chris, may I quote your email on BASH? They take emails too? Regardless ... feel free to quote or forward any of my emails wherever you seem fit. Have fun, Christian -- quelltextlich e.U. \\ Christian Aistleit

Re: [Analytics] [Cluster] Monitoring the impact Hive jobs have on the Analytics cluster

2015-03-09 Thread Joseph Allemandou
Thanks a lot Christian :) I had not meant by any mean last Friday to overload the cluster ... I did it nonetheless. Your page on how to 'keep an eye on it' will really be useful! Cheers Joseph On Sun, Mar 8, 2015 at 8:26 PM, Leila Zia wrote: > This is really useful, Christian. Thanks for explai

Re: [Analytics] stats.grok.se not updating

2015-03-09 Thread Ariel T. Glenn
Στις 08-03-2015, ημέρα Κυρ, και ώρα 09:54 -0700, ο/η Vipul Naik έγραψε: > Seems like stats.grok.se hasn't updated for the last two days again. > Will it be back to updating soon? > Henrik, if bandwidth is below what you were seeing (i.e. overloaded again), for now you could point to ms1001.wikimed