Re: [Wiki-research-l] 2012 top pageview list

2012-12-28 Thread John Vandenberg
There is a steady stream of blogs and 'news' about these lists

https://encrypted.google.com/search?client=ubuntu&channel=fs&q=%22Sean+hoyland%22&ie=utf-8&oe=utf-8#q=wikipedia+top+2012&hl=en&safe=off&client=ubuntu&tbo=d&channel=fs&tbm=nws&source=lnt&tbs=qdr:w&sa=X&psj=1&ei=GzjeUOPpAsfnrAeQk4DgCg&ved=0CB4QpwUoAw&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&bvm=bv.1355534169,d.aWM&fp=4e60e761ee133369&bpcl=40096503&biw=1024&bih=539

How does a researcher go about obtaining access logs with useragents
in order to answer some of these questions?

-- 
John Vandenberg

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2012-12-28 Thread Tilman Bayer
On Fri, Dec 28, 2012 at 10:24 AM, John Vandenberg  wrote:

> Is favicon only in the Chinese Wikipedia top 100?
>
> It seems so, and is odd if the problem is a web browser bug.
>
> John Vandenberg.
> sent from Galaxy Note
> On Dec 28, 2012 4:07 PM, "Johan Gunnarsson" 
> wrote:
>
>>  On Fri, Dec 28, 2012 at 5:33 AM, John Vandenberg 
>> wrote:
>> > Hi Johan,
>> >
>> > Thank you for the lovely data at
>> >
>> > https://toolserver.org/~johang/2012.html
>> >
>> > I posted that link to my facebook (below if you want to join in
>> > there), and a few language specific facebook groups, and there have
>> > been some concerns raised about the results, which I'll list below.
>> >
>> > These lists are getting some traction in the press so it would be good
>> > to understand it better.
>> >
>> > http://guardian.co.uk/technology/blog/2012/dec/27/wikipedia-most-viewed
>>
>> Cool, cool.
>>
>>
>> >
>> > Why is [[zh:Favicon]] #2?
>> >
>> > The data doesnt appear to support that
>> >
>> > http://stats.grok.se/zh/201201/Favicon
>> > http://stats.grok.se/zh/latest90/Favicon
>>
>> My post-processing filtering follows redirects to find the "true"
>> title. In this case the page Favicon.ico redirects to Favicon. This is
>> probably due to broken browsers trying to load the icon.
>>
>>
>> >
>> > Number 1 in French is a plant native to asia.  The stats for December
>> disagree
>> > https://en.wikipedia.org/wiki/Ilex_crenata
>> > http://stats.grok.se/fr/201212/Houx_cr%C3%A9nel%C3%A9
>>
>> French's Ilex_crenata redirects to Houx_crénelé.
>>
>> Ilex_crenata had huge traffic in April:
>> http://stats.grok.se/fr/201204/Ilex_crenata
>>
>> There are a bunch of spikes like this. I can't really explain it. I
>> talked to Domas Mituzas (the maintainer of the original dumps I use)
>> yesterday and he suggested it might be bots going crazy for whatever
>> reason. I'd love to filter all these false positives, but haven't been
>> able to come up with an easy way to do it.
>>
>> Might be possible with access to logs with the user-agent string, but
>> that would probably inflate the dataset size even more. It's already
>> past the terabyte. However that could probably be solved by sampling
>> (for example) 1/100 of the entries.
>>
>> Comments and ideas are welcome!
>>
>>
>> >
>> > Number 1 in German is Cul de sac. This is odd, but matches the stats
>> > http://stats.grok.se/de/201207/Sackgasse
>>
>> RIght. This one is funny. It has huge traffic on weekdays only.
>> Deserted on weekends.
>
> This has been noted on the dewiki village pump before. The most
interesting guess
there(by
Benutzer:YMS): There might be a web filtering software installed on
workplace PCs in companies which redirects all prohibited URLs to the
German Wikipedia on cul-de-sac. This would explain the weekly pattern, and
also http://stats.grok.se/de/201112/Sackgasse (December 25-26 are holidays
in Germany, and many employees take the rest of the year off).


>
>>
>> >
>> > Number 1 in Dutch is a Chinese mountain.  The stats for December
>> disagree
>> > http://stats.grok.se/nl/201212/Hua_Shan
>>
>> July/August agree: http://stats.grok.se/nl/201208/Hua_Shan
>>
>>
>> >
>> > Number 4 in Hebrew is zipper.  The stats for December disagree
>> > http://stats.grok.se/he/201212/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F
>>
>> April agrees:
>> http://stats.grok.se/he/201204/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F
>>
>>
>> >
>> > Number 2 in Spanish is '@'.  This is odd, but matches the stats
>> > http://stats.grok.se/es/201212/Arroba_%28s%C3%ADmbolo%29
>> >
>> > --
>> > John Vandenberg
>> > https://www.facebook.com/johnmark.vandenberg
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
Tilman Bayer
Senior Operations Analyst (Movement Communications)
Wikimedia Foundation
IRC (Freenode): HaeB
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2012-12-28 Thread John Vandenberg
Is favicon only in the Chinese Wikipedia top 100?

It seems so, and is odd if the problem is a web browser bug.

John Vandenberg.
sent from Galaxy Note
On Dec 28, 2012 4:07 PM, "Johan Gunnarsson" 
wrote:

> On Fri, Dec 28, 2012 at 5:33 AM, John Vandenberg  wrote:
> > Hi Johan,
> >
> > Thank you for the lovely data at
> >
> > https://toolserver.org/~johang/2012.html
> >
> > I posted that link to my facebook (below if you want to join in
> > there), and a few language specific facebook groups, and there have
> > been some concerns raised about the results, which I'll list below.
> >
> > These lists are getting some traction in the press so it would be good
> > to understand it better.
> >
> > http://guardian.co.uk/technology/blog/2012/dec/27/wikipedia-most-viewed
>
> Cool, cool.
>
> >
> > Why is [[zh:Favicon]] #2?
> >
> > The data doesnt appear to support that
> >
> > http://stats.grok.se/zh/201201/Favicon
> > http://stats.grok.se/zh/latest90/Favicon
>
> My post-processing filtering follows redirects to find the "true"
> title. In this case the page Favicon.ico redirects to Favicon. This is
> probably due to broken browsers trying to load the icon.
>
> >
> > Number 1 in French is a plant native to asia.  The stats for December
> disagree
> > https://en.wikipedia.org/wiki/Ilex_crenata
> > http://stats.grok.se/fr/201212/Houx_cr%C3%A9nel%C3%A9
>
> French's Ilex_crenata redirects to Houx_crénelé.
>
> Ilex_crenata had huge traffic in April:
> http://stats.grok.se/fr/201204/Ilex_crenata
>
> There are a bunch of spikes like this. I can't really explain it. I
> talked to Domas Mituzas (the maintainer of the original dumps I use)
> yesterday and he suggested it might be bots going crazy for whatever
> reason. I'd love to filter all these false positives, but haven't been
> able to come up with an easy way to do it.
>
> Might be possible with access to logs with the user-agent string, but
> that would probably inflate the dataset size even more. It's already
> past the terabyte. However that could probably be solved by sampling
> (for example) 1/100 of the entries.
>
> Comments and ideas are welcome!
>
> >
> > Number 1 in German is Cul de sac. This is odd, but matches the stats
> > http://stats.grok.se/de/201207/Sackgasse
>
> RIght. This one is funny. It has huge traffic on weekdays only.
> Deserted on weekends.
>
> >
> > Number 1 in Dutch is a Chinese mountain.  The stats for December disagree
> > http://stats.grok.se/nl/201212/Hua_Shan
>
> July/August agree: http://stats.grok.se/nl/201208/Hua_Shan
>
> >
> > Number 4 in Hebrew is zipper.  The stats for December disagree
> > http://stats.grok.se/he/201212/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F
>
> April agrees:
> http://stats.grok.se/he/201204/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F
>
> >
> > Number 2 in Spanish is '@'.  This is odd, but matches the stats
> > http://stats.grok.se/es/201212/Arroba_%28s%C3%ADmbolo%29
> >
> > --
> > John Vandenberg
> > https://www.facebook.com/johnmark.vandenberg
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l