Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

2016-09-20 Thread Andrew Gray
Hi all,

I'd strongly caution against using the stub categories without *also*
doing some kind of filtering on size. There's a real problem with
"stub lag" - articles get tagged, incrementally improve, no-one thinks
they've done enough to justify removing the tag (or notices the tag is
there, or thinks they're allowed to remove it)... and you end up with
a lot of multi-section pages a good hundred words of text still
labelled "stub"

(Talkpage ratings are even worse for this, but that's another issue.)

Andrew.

On 20 September 2016 at 18:01, Morten Wang  wrote:
> I don't know of a clean, language-independent way of grabbing all stubs.
> Stuart's suggestion is quite sensible, at least for English Wikipedia. When
> I last checked a few years ago, the mean length of an English language stub
> (on a log-scale) is around 1kB (including all markup), and they're quite
> much smaller than any other class.
>
> I'd also see if the category system allows for some straightforward
> retrieval. English has
> https://en.wikipedia.org/wiki/Category:Stub_categories and
> https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
> other languages, which could be a good starting point. For some of the
> research we've done on quality, exploiting regularities in the category
> system using database access (in other words, LIKE-queries), is a quick way
> to grab most articles.
>
> A combination of both approaches might be a good way. If you're looking for
> even more thorough classification, grabbing a set and training a classifier
> might be the way to go.
>
>
> Cheers,
> Morten
>
>
> On 20 September 2016 at 02:40, Stuart A. Yeates  wrote:
>>
>> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
>> cutoff. There is weaponised javascript to measure that at en:WP:Did you
>> know/DYKcheck
>>
>> Probably doesn't translate to CJK languages which have radically different
>> information content per character.
>>
>> cheers
>> stuart
>>
>> --
>> ...let us be heard from red core to black sky
>>
>> On Tue, Sep 20, 2016 at 9:26 PM, Robert West  wrote:
>>>
>>> Hi everyone,
>>>
>>> Does anyone know if there's a straightforward (ideally
>>> language-independent) way of identifying stub articles in Wikipedia?
>>>
>>> Whatever works is ok, whether it's publicly available data or data
>>> accessible only on the WMF cluster.
>>>
>>> I've found lists for various languages (e.g., Italian or English), but
>>> the lists are in different formats, so separate code is required for each
>>> language, which doesn't scale.
>>>
>>> I guess in the worst case, I'll have to grep for the respective stub
>>> templates in the respective wikitext dumps, but even this requires to know
>>> for each language what the respective template is. So if anyone could point
>>> me to a list of stub templates in different languages, that would also be
>>> appreciated.
>>>
>>> Thanks!
>>> Bob
>>>
>>> --
>>> Up for a little language game? -- http://www.unfun.me
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to get the exact date when an article get a quality promotion?

2016-06-10 Thread Andrew Gray
Hi Shiyue,

I agree with Kelly - these ratings probably won't do what you need, in that
case. Sorry!

We simply don't have the people (or the enthusiasm) required to do regular
updates and I'd guess many are well over five years 'stale' since last
rating - and most will only ever have been rated once.

There's a second complicating factor for old ratings - not only are they
stale, but the general standards for that rating might have changed. (See
eg
http://www.generalist.org.uk/blog/2010/quality-versus-age-of-wikipedias-featured-articles/
for a demonstration of that last point - it would be interesting to use
ORES to do a bigger sample)

Andrew.
On 10 Jun 2016 07:13, "Shiyue Zhang"  wrote:

> Hi Kerry,
>
> Thanks a lot for your reply! Honestly, I am not aware of the problem you
> mentioned that many wikiprojects don't do regular quality assessment. This
> problem really matters to me, because I want to get the relatively true
> quality of a revision of an article. I know Aaron's automated quality
> assessment tool, but it is also based on a machine learning classifier,
> which is also my goal to automatically predict quality, especially quality
> change. So I can't take the results of this tool as my ground truth.
>
> 2016-06-10 12:16 GMT+08:00 Kerry Raymond :
>
>> If you are not aware of it, many wikiprojects don’t do any kind of
>> regular quality assessment. Often an article is project-tagged and assessed
>> when it’s new (which generally means the quality is assessed stub/start/C)
>> and then it’s never re-assessed unless someone working on it is trying to
>> get it to GA or similar and hence actively requests assessment.
>>
>>
>>
>> So it’s easy for an article to be much better quality (or even much worse
>> quality, although that’s probably less likely) than its current assessment.
>>
>>
>>
>> I think you might do better to use Aaron’s automated quality assessment
>> tool and apply it to different versions of a set of article and see how
>> that changes over time. Whatever the deficiencies of an automated tool, I
>> suspect it’s still more reliable than the human processes that we actually
>> have. But I guess it depends on whether the focus of your study is the
>> quality of articles or is it the process of assessing the quality of
>> articles? My sense is that you are interested in the former rather than the
>> latter.
>>
>>
>>
>> Kerry
>>
>>
>>
>> *From:* Wiki-research-l [mailto:
>> wiki-research-l-boun...@lists.wikimedia.org] *On Behalf Of *Shiyue Zhang
>> *Sent:* Friday, 10 June 2016 12:42 PM
>> *To:* Research into Wikimedia content and communities <
>> wiki-research-l@lists.wikimedia.org>
>> *Subject:* Re: [Wiki-research-l] How to get the exact date when an
>> article get a quality promotion?
>>
>>
>>
>> Hi Pine,
>>
>>
>>
>> Thanks for your reply. Yes, it is English Wikipedia. Exactly I want to
>> get the timestamp of an article's quality rating change. I know
>> the particular diffs shouldn't be considered as the reason why quality
>> rating change. I'm trying to get a prediction of quality change beyond a
>> certain time period, so I need the start and end quality of the time
>> period.
>>
>>
>>
>> I hope anyone have the experience on this problem can give me some
>> advice. Thanks a lot!!!
>>
>>
>>
>> 2016-06-10 9:47 GMT+08:00 Pine W :
>>
>> Hi Zhang,
>>
>> Is this for English Wikipedia?
>>
>> You can probably use automation to find the timestamp of an article's
>> quality rating change on English Wikipedia. Other people on this list
>> probably know how to do this, and they may comment here.
>>
>> However, that does not imply that any paricular diffs should be
>> considered to have a quality that is equivalent to the quality of the
>> article. Measuring the quality of diffs is an inexact science, but you
>> might want to take a look at Revision Scoring. Aaron Halfaker can tell you
>> more about how useful, or not, Revision Scoring is for measuring the
>> quality of diffs. Hopefully he will respond to this email.
>>
>> Pine
>>
>> On Jun 9, 2016 18:29, "Shiyue Zhang"  wrote:
>>
>> Hi,
>>
>>
>>
>> I'm doing research on Wikipedia article quality, and I take advantage of
>> WikiProject Assessments. But I can only get the latest quality level of an
>> article. I wonder how to  get the quality of each revision, or how to get
>> the exact date when an article get a quality promotion, for example, from
>> A-class to FA-class.
>>
>>
>>
>> I really need your help! Thanks!
>>
>>
>>
>> Zhang Shiyue
>>
>>
>>
>> --
>>
>> Zhang Shiyue
>>
>> *Tel*: +86 18801167900
>>
>> *E-mail*: byry...@gmail.com, yuer3...@163.com
>>
>> State Key Laboratory of Networking and Switching Technology
>>
>> No.10 Xitucheng Road, Haidian District
>>
>> Beijing University of Posts and Telecommunications
>>
>> Beijing, China.
>>
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>> _

Re: [Wiki-research-l] unique visitors

2016-03-19 Thread Andrew Gray
On 17 March 2016 at 19:40, phoebe ayers  wrote:

>> One of the drawbacks is that we
>> can't report on a single total number across all our projects.
>
> Hmm. That's unfortunate for understanding reach -- if nothing else,
> the idea that "half a billion people access Wikipedia" (eg from
> earlier comscore reports) was a PR-friendly way of giving an idea of
> the scale of our readership. But I can see why it would be tricky to
> measure. Since this is the research list: I suspect there's still lots
> to be done in understanding just how multilingual people use different
> language editions of Wikipedia, too.

Building on this question a little: with the information we currently
have, is it actively *wrong* for us to keep using the "half a billion"
figure as a very rough first-order estimate? (Like Phoebe, I think I
keep trotting it out when giving talks). Do the new figures give us
reason to think it's substantially higher or lower than that, or even
not meaningfully answerable?

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Looking for help finding tools to measure UNESCO project

2015-10-06 Thread Andrew Gray
On 6 October 2015 at 14:12, Amir E. Aharoni
 wrote:
> Thanks for this email.
>
> This raises a wider question: What is the comfortable way to compare the
> coverage of a topic in different languages?
>
> For example, I'd love to see a report that says:
>
> Number of articles about UNESCO cultural heritage:
> English Wikipedia: 1000
> French Wikipedia: 1200
> Hebrew Wikipedia: 742
> etc.
>
> And also to track this over time, so if somebody would work hard on creating
> articles about UNESCO cultural heritage in Hebrew, I'd see a trend graph.

There's two general approaches to this:

a) On Wikidata
b) On the individual wikis

Approach (a) would rely on having a defined set of things in Wikidata
that we can identify. For example, "is a World Heritage Site" would be
easy enough, since we have a property explicitly dealing with WHS
identifiers (and we have 100% coverage in Wikidata). "Is of interest
to UNESCO" is a trickier one - but if you can construct a suitable
Wikidata query...

As Federico notes, for WHS records, we can generate a report like
https://tools.wmflabs.org/mix-n-match/?mode=sitestats&catalog=93
(57.4% coverage on hewiki!). No graphs but if you were interested then
you could probably set one up without much work.

b) is more useful for fuzzy groups like "of relevance to UNESCO",
since this is more or less perfect for a category system. However, it
would require examining the category tree for each WP you're
interested in to figure out exactly which categories are relevant, and
then running a script to count those daily.

A.
-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Spam] Re: citations to articles cited on wikipedia?

2015-08-21 Thread Andrew Gray
They did; DOAJ seems to have been the method used to determine whether
a journal was OA or not (which is fair enough).

Andrew.

On 21 August 2015 at 12:50, Federico Leva (Nemo)  wrote:
> Andrew Gray, 20/08/2015 14:21:
>>
>> They worked on a journal basis, classing them as "OA" or "not OA".
>
>
> Weird, why didn't they just use DOAJ? https://doaj.org/
>
> Nemo
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] citations to articles cited on wikipedia?

2015-08-20 Thread Andrew Gray
On 20 August 2015 at 06:54, Jane Darnell  wrote:
> ..."the odds that an open access journal is referenced on
> the English Wikipedia are 47% higher compared to closed access"
>
> Thanks for posting! That's an interesting paper, for all sorts of reasons. I
> read it because I highly doubt that the number is as low as that. There is

I've been meaning to actually go through this paper for a while, and
finally did so this morning :-).

They worked on a journal basis, classing them as "OA" or "not OA". But
this is, in some ways, a very small sample. See, eg/, pp.
http://science-metrix.com/files/science-metrix/publications/d_1.8_sm_ec_dg-rtd_proportion_oa_1996-2013_v11p.pdf,
which suggests that articles in gold OA titles represent less than 15%
of the total amount "freely available" through various forms.

Given this limitation, it seems quite plausible that the actual
OA:citation correlation is higher on a *per-paper* basis... we just
don't really have the information to be sure.

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Wikilink referral statistics

2015-04-30 Thread Andrew Gray
Hi Aaron,

You're quite right - should have read Dario's old email in a bit more
detail! Apologies, Ellery...

I'm very curious to see the results of the see-also study, and it
strikes me that we could also use this to get some idea of reading
persistence - how much more likely is it that a (unique) link in the
first half of the page is followed versus one in the second half.

Andrew.

On 29 April 2015 at 15:54, Aaron Halfaker  wrote:
> Indeed Andrew.  Upon re-reading, I think you're right.  Thanks for pointing
> to that dataset.
>
> Also, primary credit for that dataset should go to Ellery Wulczyn.  :)
> Credit where it is due.
>
> Wulczyn, Ellery; Taraborelli, Dario (2015): Wikipedia Clickstream. figshare.
> http://dx.doi.org/10.6084/m9.figshare.1305770
>
> -Aaron
>
>
> On Wed, Apr 29, 2015 at 9:47 AM, Andrew Gray 
> wrote:
>>
>> Hi Aaron,
>>
>> I may be misreading the request but I think what's being looked at
>> here is Wikipedia -> Wikipedia links - so the referring server + the
>> referred server are both ours.
>>
>> Given that, I *think* this data Dario put out earlier in the year
>> would be what's needed - http://dx.doi.org/10.6084/m9.figshare.1305770
>> - but with the caveat that it's only enwiki and only for two months.
>> It won't identify which link on a page was used (if it appears
>> multiple times), but most "see also" links are unique within the page
>> and so this shouldn't pose a problem.
>>
>> Andrew.
>>
>>
>> On 29 April 2015 at 14:47, Aaron Halfaker 
>> wrote:
>> > Hi Physikerwelt,
>> >
>> > I'm not sure how we'd collect that data.  You'd need to gather it from
>> > whatever server the user's browser made a request to after clicking one
>> > of
>> > those links.  That's how referrers work.  Also, clicks to non-https
>> > links
>> > from https Wikipedia will not contain referrers.  See
>> > https://meta.wikimedia.org/wiki/Research:Wikimedia_referrer_policy for a
>> > proposal to update our policy.
>> >
>> > -Aaron
>> >
>> > On Wed, Apr 29, 2015 at 6:33 AM, Physikerwelt 
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> is there information about referrals within enwiki?
>> >> We are investigating the quality of the "See also" links and are
>> >> looking
>> >> for estimates how often the see also links were used.
>> >> If so can we access the information from eqiad.wmflabs?
>> >>
>> >> Best
>> >> Physikerwelt
>> >>
>> >> ___
>> >> Wiki-research-l mailing list
>> >> Wiki-research-l@lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> >>
>> >
>> >
>> > _______
>> > Wiki-research-l mailing list
>> > Wiki-research-l@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> >
>>
>>
>>
>> --
>> - Andrew Gray
>>   andrew.g...@dunelm.org.uk
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Wikilink referral statistics

2015-04-29 Thread Andrew Gray
Hi Aaron,

I may be misreading the request but I think what's being looked at
here is Wikipedia -> Wikipedia links - so the referring server + the
referred server are both ours.

Given that, I *think* this data Dario put out earlier in the year
would be what's needed - http://dx.doi.org/10.6084/m9.figshare.1305770
- but with the caveat that it's only enwiki and only for two months.
It won't identify which link on a page was used (if it appears
multiple times), but most "see also" links are unique within the page
and so this shouldn't pose a problem.

Andrew.


On 29 April 2015 at 14:47, Aaron Halfaker  wrote:
> Hi Physikerwelt,
>
> I'm not sure how we'd collect that data.  You'd need to gather it from
> whatever server the user's browser made a request to after clicking one of
> those links.  That's how referrers work.  Also, clicks to non-https links
> from https Wikipedia will not contain referrers.  See
> https://meta.wikimedia.org/wiki/Research:Wikimedia_referrer_policy for a
> proposal to update our policy.
>
> -Aaron
>
> On Wed, Apr 29, 2015 at 6:33 AM, Physikerwelt  wrote:
>>
>> Hi,
>>
>> is there information about referrals within enwiki?
>> We are investigating the quality of the "See also" links and are looking
>> for estimates how often the see also links were used.
>> If so can we access the information from eqiad.wmflabs?
>>
>> Best
>> Physikerwelt
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] (no subject)

2015-03-17 Thread Andrew Gray
I've noted Finland (as a country) before when looking at Erik's data -
IIRC, there's a vaguely normal-looking distribution of
pages-per-internet-user-per-month for the Western European countries,
and Finland is at the upper end but not a dramatic outlier, it's in a
group with eg Sweden, Austria, etc.

http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryOverview.htm

This pattern has been around since at least 2012:

http://web.archive.org/web/20120922063053/http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryOverview.htm

(not sure why the 2012 per-country numbers are so much higher...)

Andrew.

On 16 March 2015 at 09:30, Oliver Keyes  wrote:
> Awesome work! It's interesting to see Finnish as the outlier here. Do
> we have any fi-users on the list who can comment on this and might
> know what's going on? (And, in the absence of Finns: Jan, heard
> anything from across the border? :p)
>
> The only caution I'd raise is that these numbers don't include spider
> filtering. Why is this important? Well, a lot of traffic is driven by
> crawlers and spiders and automata, particularly on smaller projects,
> and it can lead to weirdness as a result. With the granular pagecount
> files there's some work that can be done to detect this (for example,
> using burst detection and a few heuristics around concentration
> measures to eliminate pages that are clearly driven by automated
> traffic - see the recent analytics mailing list thread) but only some.
> I appreciate this is a flaw in the data we are releasing, not in your
> work, which is an excellent read and highly interesting :). I agree
> that understanding the lack of development in the PRC and ROK is
> crucial - we keep talking about the "next billion readers" but only
> talking :(
>
> On 16 March 2015 at 02:21, h  wrote:
>> Dear all,
>>
>> I have some findings to show the page views per Internet user
>> measurement may help comparing different language editions of Wikipedia.
>> Criticism and suggestions are welcome.
>>
>>
>> -
>> http://people.oii.ox.ac.uk/hanteng/2015/03/15/comparing-language-development-in-wikipedia-in-terms-of-page-views-per-internet-users/
>>
>> Which language version of Wikipedia enjoys the most page views per language
>> Internet user than expected? It is Finnish. In terms of absolute positive
>> and negative gap, English has the widest positive gap whereas Chinese has
>> the largest negative gap.
>>
>> ..
>>
>> In particular, it is known that Wikipedia (and Google which often favours
>> Wikipedia) faces local competition in the People's Republic of China and
>> South Korea. Therefore it is understandable the page views may be lower in
>> Chinese and Korean Wikipedia language projects simply because some users'
>> need to read user-generated encyclopedias are satisfied by other websites.
>> However, it remains an important question to examine why these particular
>> Latin and Asian languages are under-developed for Wikipedia projects.
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] preelminary results from the Wikipedia Gender Inequality Index project - comments welcome

2015-01-16 Thread Andrew Gray
 the project, see
>>>>> https://meta.wikimedia.org/wiki/Research:Wikipedia_Gender_Inequality_Index
>>>>> and it's talk page). We are very curious what you think (don't hesitate to
>>>>> be critical). What we would really appreciate would be any alternative
>>>>> hypotheses (to the one presented) that could try to explain why post-1950s
>>>>> Confucian and South Asian clusters seem so much more inclusive of female
>>>>> biographies than others (including the "Western" clusters). Are we seeing 
>>>>> a
>>>>> data error, or something else - and if so, what?
>>>>>
>>>>> --
>>>>> Piotr Konieczny, PhD
>>>>> http://hanyang.academia.edu/PiotrKonieczny
>>>>> http://scholar.google.com/citations?user=gdV8_AEJ
>>>>> http://en.wikipedia.org/wiki/User:Piotrus
>>>>>
>>>>>
>>>>> ___
>>>>> Wiki-research-l mailing list
>>>>> Wiki-research-l@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>
>>>>
>>>>
>>>> ___
>>>> Wiki-research-l mailing list
>>>> Wiki-research-l@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How many links did TWL account recipients add to Wikipedia with their access?

2015-01-15 Thread Andrew Gray
One approach might be to get a list of all pages/links added using
Mark's method, then for each of the sets do a manual sample of a few
percent of the new links and see who added them - this would let you
know if you're looking at a situation where "almost all links to XYZ
were added by TWL users" or "only some links were added by non-TWL
users", and estimate accordingly.

(I suspect some will definitely be in the first batch)

Andrew.

On 14 January 2015 at 13:32, mjn  wrote:
>
> Aaron Halfaker  writes:
>> ...you'll need to parse wiki content in order to extract external links.
>> I don't think they are stored in a table anywhere.
>
> The links themselves are, but it isn't tied to editor information, so I
> don't think will answer this particular query. In the database dumps at
> dumps.wikimedia.org, the table that's dumped as
> xxwiki-mmdd-externallinks.sql.gz lists external links per-page. So
> if you just wanted counts of link additions (or removals), you could
> grab two dumps from different dates and compare.  But you'll need to
> parse the full revision information to get a count of who added which
> links.
>
> -Mark
>
> --
> mjn | http://www.anadrome.org
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

2015-01-13 Thread Andrew Gray
Fair enough - I don't use it, and I think I'd got entirely the wrong
end of the stick on what it's for! If it's intended to stop tracking
by third-party sites then it certainly seems to be of little relevance
here.

(It might be worth clarifying this in the proposal, in case a future
ethics-committee reviewer gets the same misapprehension?)

Andrew.

On 13 January 2015 at 20:24, Aaron Halfaker  wrote:
> Andrew,
>
> I think it is reasonable to assume that the "Do not track" header isn't
> referring to this.
>
> From http://donottrack.us/ with emphasis added.
>>
>> Do Not Track is a technology and policy proposal that enables users to opt
>> out of tracking by websites they do not visit, [...]
>
>
> Do not track is explicitly for third party tracking.  We are merely
> proposing to count those people who do access our sites.  Note that, in this
> case, we are not interested in obtaining identifiers at all, so the word
> "track" seems to not apply.
>
> It seems like we're looking for something like a "Do Not Log Anything At
> All" header.  I don't believe that such a thing exists -- but if it did I
> think it would be good if we supported it.
>
> -Aaron
>
> On Tue, Jan 13, 2015 at 2:03 PM, Andrew Gray 
> wrote:
>>
>> Hi Dario, Reid,
>>
>> This seems sensible enough and proposal #3 is clearly the better
>> approach. An explicit opt-in opt-out mechanism would not be worth the
>> effort to build and would become yet another ignored preferences
>> setting after a few weeks...
>>
>> A couple of thoughts:
>>
>> * I understand the reasoning for not using do-not-track headers (#4);
>> however, it feels a bit odd to say "they probably don't mean us" and
>> skip them... I can almost guarantee you'll have at least one person
>> making a vocal fuss about not being able to opt-out without an
>> account. If we were to honour these headers, would it make a
>> significant change to the amount of data available? Would it likely
>> skew it any more than leaving off logged-in users?
>>
>> * Option 3 does releases one further piece of information over and
>> above those listed - an approximate ratio of logged in versus
>> non-logged-in pageviews for a page. I cannot see any particular
>> problem with doing this (and I can think of a couple of fun things to
>> use it for) but it's probably worth being aware.
>>
>> Andrew.
>>
>> On 13 January 2015 at 07:26, Dario Taraborelli
>>  wrote:
>> > I’m sharing a proposal that Reid Priedhorsky and his collaborators at
>> > Los Alamos National Laboratory recently submitted to the Wikimedia 
>> > Analytics
>> > Team aimed at producing privacy-preserving geo-aggregates of Wikipedia
>> > pageview data dumps and making them available to the public and the 
>> > research
>> > community. [1]
>> >
>> > Reid and his team spearheaded the use of the public Wikipedia pageview
>> > dumps to monitor and forecast the spread of influenza and other diseases,
>> > using language as a proxy for location. This proposal describes an
>> > aggregation strategy adding a geographical dimension to the existing dumps.
>> >
>> > Feedback on the proposal is welcome on the lists or the project talk
>> > page on Meta [3]
>> >
>> > Dario
>> >
>> > [1]
>> > https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
>> > [2] http://dx.doi.org/10.1371/journal.pcbi.1003892
>> > [3]
>> > https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews
>> > ___
>> > Analytics mailing list
>> > analyt...@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>> --
>> - Andrew Gray
>>   andrew.g...@dunelm.org.uk
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal

2015-01-13 Thread Andrew Gray
Hi Dario, Reid,

This seems sensible enough and proposal #3 is clearly the better
approach. An explicit opt-in opt-out mechanism would not be worth the
effort to build and would become yet another ignored preferences
setting after a few weeks...

A couple of thoughts:

* I understand the reasoning for not using do-not-track headers (#4);
however, it feels a bit odd to say "they probably don't mean us" and
skip them... I can almost guarantee you'll have at least one person
making a vocal fuss about not being able to opt-out without an
account. If we were to honour these headers, would it make a
significant change to the amount of data available? Would it likely
skew it any more than leaving off logged-in users?

* Option 3 does releases one further piece of information over and
above those listed - an approximate ratio of logged in versus
non-logged-in pageviews for a page. I cannot see any particular
problem with doing this (and I can think of a couple of fun things to
use it for) but it's probably worth being aware.

Andrew.

On 13 January 2015 at 07:26, Dario Taraborelli
 wrote:
> I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los 
> Alamos National Laboratory recently submitted to the Wikimedia Analytics Team 
> aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview 
> data dumps and making them available to the public and the research 
> community. [1]
>
> Reid and his team spearheaded the use of the public Wikipedia pageview dumps 
> to monitor and forecast the spread of influenza and other diseases, using 
> language as a proxy for location. This proposal describes an aggregation 
> strategy adding a geographical dimension to the existing dumps.
>
> Feedback on the proposal is welcome on the lists or the project talk page on 
> Meta [3]
>
> Dario
>
> [1] 
> https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews
> [2] http://dx.doi.org/10.1371/journal.pcbi.1003892
> [3] 
> https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l