Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages
Hi all, I'd strongly caution against using the stub categories without *also* doing some kind of filtering on size. There's a real problem with "stub lag" - articles get tagged, incrementally improve, no-one thinks they've done enough to justify removing the tag (or notices the tag is there, or thinks they're allowed to remove it)... and you end up with a lot of multi-section pages a good hundred words of text still labelled "stub" (Talkpage ratings are even worse for this, but that's another issue.) Andrew. On 20 September 2016 at 18:01, Morten Wang wrote: > I don't know of a clean, language-independent way of grabbing all stubs. > Stuart's suggestion is quite sensible, at least for English Wikipedia. When > I last checked a few years ago, the mean length of an English language stub > (on a log-scale) is around 1kB (including all markup), and they're quite > much smaller than any other class. > > I'd also see if the category system allows for some straightforward > retrieval. English has > https://en.wikipedia.org/wiki/Category:Stub_categories and > https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to > other languages, which could be a good starting point. For some of the > research we've done on quality, exploiting regularities in the category > system using database access (in other words, LIKE-queries), is a quick way > to grab most articles. > > A combination of both approaches might be a good way. If you're looking for > even more thorough classification, grabbing a set and training a classifier > might be the way to go. > > > Cheers, > Morten > > > On 20 September 2016 at 02:40, Stuart A. Yeates wrote: >> >> en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful >> cutoff. There is weaponised javascript to measure that at en:WP:Did you >> know/DYKcheck >> >> Probably doesn't translate to CJK languages which have radically different >> information content per character. >> >> cheers >> stuart >> >> -- >> ...let us be heard from red core to black sky >> >> On Tue, Sep 20, 2016 at 9:26 PM, Robert West wrote: >>> >>> Hi everyone, >>> >>> Does anyone know if there's a straightforward (ideally >>> language-independent) way of identifying stub articles in Wikipedia? >>> >>> Whatever works is ok, whether it's publicly available data or data >>> accessible only on the WMF cluster. >>> >>> I've found lists for various languages (e.g., Italian or English), but >>> the lists are in different formats, so separate code is required for each >>> language, which doesn't scale. >>> >>> I guess in the worst case, I'll have to grep for the respective stub >>> templates in the respective wikitext dumps, but even this requires to know >>> for each language what the respective template is. So if anyone could point >>> me to a list of stub templates in different languages, that would also be >>> appreciated. >>> >>> Thanks! >>> Bob >>> >>> -- >>> Up for a little language game? -- http://www.unfun.me >>> >>> ___ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to get the exact date when an article get a quality promotion?
Hi Shiyue, I agree with Kelly - these ratings probably won't do what you need, in that case. Sorry! We simply don't have the people (or the enthusiasm) required to do regular updates and I'd guess many are well over five years 'stale' since last rating - and most will only ever have been rated once. There's a second complicating factor for old ratings - not only are they stale, but the general standards for that rating might have changed. (See eg http://www.generalist.org.uk/blog/2010/quality-versus-age-of-wikipedias-featured-articles/ for a demonstration of that last point - it would be interesting to use ORES to do a bigger sample) Andrew. On 10 Jun 2016 07:13, "Shiyue Zhang" wrote: > Hi Kerry, > > Thanks a lot for your reply! Honestly, I am not aware of the problem you > mentioned that many wikiprojects don't do regular quality assessment. This > problem really matters to me, because I want to get the relatively true > quality of a revision of an article. I know Aaron's automated quality > assessment tool, but it is also based on a machine learning classifier, > which is also my goal to automatically predict quality, especially quality > change. So I can't take the results of this tool as my ground truth. > > 2016-06-10 12:16 GMT+08:00 Kerry Raymond : > >> If you are not aware of it, many wikiprojects don’t do any kind of >> regular quality assessment. Often an article is project-tagged and assessed >> when it’s new (which generally means the quality is assessed stub/start/C) >> and then it’s never re-assessed unless someone working on it is trying to >> get it to GA or similar and hence actively requests assessment. >> >> >> >> So it’s easy for an article to be much better quality (or even much worse >> quality, although that’s probably less likely) than its current assessment. >> >> >> >> I think you might do better to use Aaron’s automated quality assessment >> tool and apply it to different versions of a set of article and see how >> that changes over time. Whatever the deficiencies of an automated tool, I >> suspect it’s still more reliable than the human processes that we actually >> have. But I guess it depends on whether the focus of your study is the >> quality of articles or is it the process of assessing the quality of >> articles? My sense is that you are interested in the former rather than the >> latter. >> >> >> >> Kerry >> >> >> >> *From:* Wiki-research-l [mailto: >> wiki-research-l-boun...@lists.wikimedia.org] *On Behalf Of *Shiyue Zhang >> *Sent:* Friday, 10 June 2016 12:42 PM >> *To:* Research into Wikimedia content and communities < >> wiki-research-l@lists.wikimedia.org> >> *Subject:* Re: [Wiki-research-l] How to get the exact date when an >> article get a quality promotion? >> >> >> >> Hi Pine, >> >> >> >> Thanks for your reply. Yes, it is English Wikipedia. Exactly I want to >> get the timestamp of an article's quality rating change. I know >> the particular diffs shouldn't be considered as the reason why quality >> rating change. I'm trying to get a prediction of quality change beyond a >> certain time period, so I need the start and end quality of the time >> period. >> >> >> >> I hope anyone have the experience on this problem can give me some >> advice. Thanks a lot!!! >> >> >> >> 2016-06-10 9:47 GMT+08:00 Pine W : >> >> Hi Zhang, >> >> Is this for English Wikipedia? >> >> You can probably use automation to find the timestamp of an article's >> quality rating change on English Wikipedia. Other people on this list >> probably know how to do this, and they may comment here. >> >> However, that does not imply that any paricular diffs should be >> considered to have a quality that is equivalent to the quality of the >> article. Measuring the quality of diffs is an inexact science, but you >> might want to take a look at Revision Scoring. Aaron Halfaker can tell you >> more about how useful, or not, Revision Scoring is for measuring the >> quality of diffs. Hopefully he will respond to this email. >> >> Pine >> >> On Jun 9, 2016 18:29, "Shiyue Zhang" wrote: >> >> Hi, >> >> >> >> I'm doing research on Wikipedia article quality, and I take advantage of >> WikiProject Assessments. But I can only get the latest quality level of an >> article. I wonder how to get the quality of each revision, or how to get >> the exact date when an article get a quality promotion, for example, from >> A-class to FA-class. >> >> >> >> I really need your help! Thanks! >> >> >> >> Zhang Shiyue >> >> >> >> -- >> >> Zhang Shiyue >> >> *Tel*: +86 18801167900 >> >> *E-mail*: byry...@gmail.com, yuer3...@163.com >> >> State Key Laboratory of Networking and Switching Technology >> >> No.10 Xitucheng Road, Haidian District >> >> Beijing University of Posts and Telecommunications >> >> Beijing, China. >> >> >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> >> _
Re: [Wiki-research-l] unique visitors
On 17 March 2016 at 19:40, phoebe ayers wrote: >> One of the drawbacks is that we >> can't report on a single total number across all our projects. > > Hmm. That's unfortunate for understanding reach -- if nothing else, > the idea that "half a billion people access Wikipedia" (eg from > earlier comscore reports) was a PR-friendly way of giving an idea of > the scale of our readership. But I can see why it would be tricky to > measure. Since this is the research list: I suspect there's still lots > to be done in understanding just how multilingual people use different > language editions of Wikipedia, too. Building on this question a little: with the information we currently have, is it actively *wrong* for us to keep using the "half a billion" figure as a very rough first-order estimate? (Like Phoebe, I think I keep trotting it out when giving talks). Do the new figures give us reason to think it's substantially higher or lower than that, or even not meaningfully answerable? -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Looking for help finding tools to measure UNESCO project
On 6 October 2015 at 14:12, Amir E. Aharoni wrote: > Thanks for this email. > > This raises a wider question: What is the comfortable way to compare the > coverage of a topic in different languages? > > For example, I'd love to see a report that says: > > Number of articles about UNESCO cultural heritage: > English Wikipedia: 1000 > French Wikipedia: 1200 > Hebrew Wikipedia: 742 > etc. > > And also to track this over time, so if somebody would work hard on creating > articles about UNESCO cultural heritage in Hebrew, I'd see a trend graph. There's two general approaches to this: a) On Wikidata b) On the individual wikis Approach (a) would rely on having a defined set of things in Wikidata that we can identify. For example, "is a World Heritage Site" would be easy enough, since we have a property explicitly dealing with WHS identifiers (and we have 100% coverage in Wikidata). "Is of interest to UNESCO" is a trickier one - but if you can construct a suitable Wikidata query... As Federico notes, for WHS records, we can generate a report like https://tools.wmflabs.org/mix-n-match/?mode=sitestats&catalog=93 (57.4% coverage on hewiki!). No graphs but if you were interested then you could probably set one up without much work. b) is more useful for fuzzy groups like "of relevance to UNESCO", since this is more or less perfect for a category system. However, it would require examining the category tree for each WP you're interested in to figure out exactly which categories are relevant, and then running a script to count those daily. A. -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Spam] Re: citations to articles cited on wikipedia?
They did; DOAJ seems to have been the method used to determine whether a journal was OA or not (which is fair enough). Andrew. On 21 August 2015 at 12:50, Federico Leva (Nemo) wrote: > Andrew Gray, 20/08/2015 14:21: >> >> They worked on a journal basis, classing them as "OA" or "not OA". > > > Weird, why didn't they just use DOAJ? https://doaj.org/ > > Nemo > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] citations to articles cited on wikipedia?
On 20 August 2015 at 06:54, Jane Darnell wrote: > ..."the odds that an open access journal is referenced on > the English Wikipedia are 47% higher compared to closed access" > > Thanks for posting! That's an interesting paper, for all sorts of reasons. I > read it because I highly doubt that the number is as low as that. There is I've been meaning to actually go through this paper for a while, and finally did so this morning :-). They worked on a journal basis, classing them as "OA" or "not OA". But this is, in some ways, a very small sample. See, eg/, pp. http://science-metrix.com/files/science-metrix/publications/d_1.8_sm_ec_dg-rtd_proportion_oa_1996-2013_v11p.pdf, which suggests that articles in gold OA titles represent less than 15% of the total amount "freely available" through various forms. Given this limitation, it seems quite plausible that the actual OA:citation correlation is higher on a *per-paper* basis... we just don't really have the information to be sure. -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikilink referral statistics
Hi Aaron, You're quite right - should have read Dario's old email in a bit more detail! Apologies, Ellery... I'm very curious to see the results of the see-also study, and it strikes me that we could also use this to get some idea of reading persistence - how much more likely is it that a (unique) link in the first half of the page is followed versus one in the second half. Andrew. On 29 April 2015 at 15:54, Aaron Halfaker wrote: > Indeed Andrew. Upon re-reading, I think you're right. Thanks for pointing > to that dataset. > > Also, primary credit for that dataset should go to Ellery Wulczyn. :) > Credit where it is due. > > Wulczyn, Ellery; Taraborelli, Dario (2015): Wikipedia Clickstream. figshare. > http://dx.doi.org/10.6084/m9.figshare.1305770 > > -Aaron > > > On Wed, Apr 29, 2015 at 9:47 AM, Andrew Gray > wrote: >> >> Hi Aaron, >> >> I may be misreading the request but I think what's being looked at >> here is Wikipedia -> Wikipedia links - so the referring server + the >> referred server are both ours. >> >> Given that, I *think* this data Dario put out earlier in the year >> would be what's needed - http://dx.doi.org/10.6084/m9.figshare.1305770 >> - but with the caveat that it's only enwiki and only for two months. >> It won't identify which link on a page was used (if it appears >> multiple times), but most "see also" links are unique within the page >> and so this shouldn't pose a problem. >> >> Andrew. >> >> >> On 29 April 2015 at 14:47, Aaron Halfaker >> wrote: >> > Hi Physikerwelt, >> > >> > I'm not sure how we'd collect that data. You'd need to gather it from >> > whatever server the user's browser made a request to after clicking one >> > of >> > those links. That's how referrers work. Also, clicks to non-https >> > links >> > from https Wikipedia will not contain referrers. See >> > https://meta.wikimedia.org/wiki/Research:Wikimedia_referrer_policy for a >> > proposal to update our policy. >> > >> > -Aaron >> > >> > On Wed, Apr 29, 2015 at 6:33 AM, Physikerwelt >> > wrote: >> >> >> >> Hi, >> >> >> >> is there information about referrals within enwiki? >> >> We are investigating the quality of the "See also" links and are >> >> looking >> >> for estimates how often the see also links were used. >> >> If so can we access the information from eqiad.wmflabs? >> >> >> >> Best >> >> Physikerwelt >> >> >> >> ___ >> >> Wiki-research-l mailing list >> >> Wiki-research-l@lists.wikimedia.org >> >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> >> > >> > >> > _______ >> > Wiki-research-l mailing list >> > Wiki-research-l@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > >> >> >> >> -- >> - Andrew Gray >> andrew.g...@dunelm.org.uk >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Wikilink referral statistics
Hi Aaron, I may be misreading the request but I think what's being looked at here is Wikipedia -> Wikipedia links - so the referring server + the referred server are both ours. Given that, I *think* this data Dario put out earlier in the year would be what's needed - http://dx.doi.org/10.6084/m9.figshare.1305770 - but with the caveat that it's only enwiki and only for two months. It won't identify which link on a page was used (if it appears multiple times), but most "see also" links are unique within the page and so this shouldn't pose a problem. Andrew. On 29 April 2015 at 14:47, Aaron Halfaker wrote: > Hi Physikerwelt, > > I'm not sure how we'd collect that data. You'd need to gather it from > whatever server the user's browser made a request to after clicking one of > those links. That's how referrers work. Also, clicks to non-https links > from https Wikipedia will not contain referrers. See > https://meta.wikimedia.org/wiki/Research:Wikimedia_referrer_policy for a > proposal to update our policy. > > -Aaron > > On Wed, Apr 29, 2015 at 6:33 AM, Physikerwelt wrote: >> >> Hi, >> >> is there information about referrals within enwiki? >> We are investigating the quality of the "See also" links and are looking >> for estimates how often the see also links were used. >> If so can we access the information from eqiad.wmflabs? >> >> Best >> Physikerwelt >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] (no subject)
I've noted Finland (as a country) before when looking at Erik's data - IIRC, there's a vaguely normal-looking distribution of pages-per-internet-user-per-month for the Western European countries, and Finland is at the upper end but not a dramatic outlier, it's in a group with eg Sweden, Austria, etc. http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryOverview.htm This pattern has been around since at least 2012: http://web.archive.org/web/20120922063053/http://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountryOverview.htm (not sure why the 2012 per-country numbers are so much higher...) Andrew. On 16 March 2015 at 09:30, Oliver Keyes wrote: > Awesome work! It's interesting to see Finnish as the outlier here. Do > we have any fi-users on the list who can comment on this and might > know what's going on? (And, in the absence of Finns: Jan, heard > anything from across the border? :p) > > The only caution I'd raise is that these numbers don't include spider > filtering. Why is this important? Well, a lot of traffic is driven by > crawlers and spiders and automata, particularly on smaller projects, > and it can lead to weirdness as a result. With the granular pagecount > files there's some work that can be done to detect this (for example, > using burst detection and a few heuristics around concentration > measures to eliminate pages that are clearly driven by automated > traffic - see the recent analytics mailing list thread) but only some. > I appreciate this is a flaw in the data we are releasing, not in your > work, which is an excellent read and highly interesting :). I agree > that understanding the lack of development in the PRC and ROK is > crucial - we keep talking about the "next billion readers" but only > talking :( > > On 16 March 2015 at 02:21, h wrote: >> Dear all, >> >> I have some findings to show the page views per Internet user >> measurement may help comparing different language editions of Wikipedia. >> Criticism and suggestions are welcome. >> >> >> - >> http://people.oii.ox.ac.uk/hanteng/2015/03/15/comparing-language-development-in-wikipedia-in-terms-of-page-views-per-internet-users/ >> >> Which language version of Wikipedia enjoys the most page views per language >> Internet user than expected? It is Finnish. In terms of absolute positive >> and negative gap, English has the widest positive gap whereas Chinese has >> the largest negative gap. >> >> .. >> >> In particular, it is known that Wikipedia (and Google which often favours >> Wikipedia) faces local competition in the People's Republic of China and >> South Korea. Therefore it is understandable the page views may be lower in >> Chinese and Korean Wikipedia language projects simply because some users' >> need to read user-generated encyclopedias are satisfied by other websites. >> However, it remains an important question to examine why these particular >> Latin and Asian languages are under-developed for Wikipedia projects. >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] preelminary results from the Wikipedia Gender Inequality Index project - comments welcome
the project, see >>>>> https://meta.wikimedia.org/wiki/Research:Wikipedia_Gender_Inequality_Index >>>>> and it's talk page). We are very curious what you think (don't hesitate to >>>>> be critical). What we would really appreciate would be any alternative >>>>> hypotheses (to the one presented) that could try to explain why post-1950s >>>>> Confucian and South Asian clusters seem so much more inclusive of female >>>>> biographies than others (including the "Western" clusters). Are we seeing >>>>> a >>>>> data error, or something else - and if so, what? >>>>> >>>>> -- >>>>> Piotr Konieczny, PhD >>>>> http://hanyang.academia.edu/PiotrKonieczny >>>>> http://scholar.google.com/citations?user=gdV8_AEJ >>>>> http://en.wikipedia.org/wiki/User:Piotrus >>>>> >>>>> >>>>> ___ >>>>> Wiki-research-l mailing list >>>>> Wiki-research-l@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>> >>>> >>>> >>>> ___ >>>> Wiki-research-l mailing list >>>> Wiki-research-l@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>> >>> >>> ___ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >>> ___ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How many links did TWL account recipients add to Wikipedia with their access?
One approach might be to get a list of all pages/links added using Mark's method, then for each of the sets do a manual sample of a few percent of the new links and see who added them - this would let you know if you're looking at a situation where "almost all links to XYZ were added by TWL users" or "only some links were added by non-TWL users", and estimate accordingly. (I suspect some will definitely be in the first batch) Andrew. On 14 January 2015 at 13:32, mjn wrote: > > Aaron Halfaker writes: >> ...you'll need to parse wiki content in order to extract external links. >> I don't think they are stored in a table anywhere. > > The links themselves are, but it isn't tied to editor information, so I > don't think will answer this particular query. In the database dumps at > dumps.wikimedia.org, the table that's dumped as > xxwiki-mmdd-externallinks.sql.gz lists external links per-page. So > if you just wanted counts of link additions (or removals), you could > grab two dumps from different dates and compare. But you'll need to > parse the full revision information to get a count of who added which > links. > > -Mark > > -- > mjn | http://www.anadrome.org > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal
Fair enough - I don't use it, and I think I'd got entirely the wrong end of the stick on what it's for! If it's intended to stop tracking by third-party sites then it certainly seems to be of little relevance here. (It might be worth clarifying this in the proposal, in case a future ethics-committee reviewer gets the same misapprehension?) Andrew. On 13 January 2015 at 20:24, Aaron Halfaker wrote: > Andrew, > > I think it is reasonable to assume that the "Do not track" header isn't > referring to this. > > From http://donottrack.us/ with emphasis added. >> >> Do Not Track is a technology and policy proposal that enables users to opt >> out of tracking by websites they do not visit, [...] > > > Do not track is explicitly for third party tracking. We are merely > proposing to count those people who do access our sites. Note that, in this > case, we are not interested in obtaining identifiers at all, so the word > "track" seems to not apply. > > It seems like we're looking for something like a "Do Not Log Anything At > All" header. I don't believe that such a thing exists -- but if it did I > think it would be good if we supported it. > > -Aaron > > On Tue, Jan 13, 2015 at 2:03 PM, Andrew Gray > wrote: >> >> Hi Dario, Reid, >> >> This seems sensible enough and proposal #3 is clearly the better >> approach. An explicit opt-in opt-out mechanism would not be worth the >> effort to build and would become yet another ignored preferences >> setting after a few weeks... >> >> A couple of thoughts: >> >> * I understand the reasoning for not using do-not-track headers (#4); >> however, it feels a bit odd to say "they probably don't mean us" and >> skip them... I can almost guarantee you'll have at least one person >> making a vocal fuss about not being able to opt-out without an >> account. If we were to honour these headers, would it make a >> significant change to the amount of data available? Would it likely >> skew it any more than leaving off logged-in users? >> >> * Option 3 does releases one further piece of information over and >> above those listed - an approximate ratio of logged in versus >> non-logged-in pageviews for a page. I cannot see any particular >> problem with doing this (and I can think of a couple of fun things to >> use it for) but it's probably worth being aware. >> >> Andrew. >> >> On 13 January 2015 at 07:26, Dario Taraborelli >> wrote: >> > I’m sharing a proposal that Reid Priedhorsky and his collaborators at >> > Los Alamos National Laboratory recently submitted to the Wikimedia >> > Analytics >> > Team aimed at producing privacy-preserving geo-aggregates of Wikipedia >> > pageview data dumps and making them available to the public and the >> > research >> > community. [1] >> > >> > Reid and his team spearheaded the use of the public Wikipedia pageview >> > dumps to monitor and forecast the spread of influenza and other diseases, >> > using language as a proxy for location. This proposal describes an >> > aggregation strategy adding a geographical dimension to the existing dumps. >> > >> > Feedback on the proposal is welcome on the lists or the project talk >> > page on Meta [3] >> > >> > Dario >> > >> > [1] >> > https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews >> > [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 >> > [3] >> > https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews >> > ___ >> > Analytics mailing list >> > analyt...@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> -- >> - Andrew Gray >> andrew.g...@dunelm.org.uk >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > ___ > Analytics mailing list > analyt...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Analytics] Geo-aggregation of Wikipedia page views: Maximizing geographic granularity while preserving privacy – a proposal
Hi Dario, Reid, This seems sensible enough and proposal #3 is clearly the better approach. An explicit opt-in opt-out mechanism would not be worth the effort to build and would become yet another ignored preferences setting after a few weeks... A couple of thoughts: * I understand the reasoning for not using do-not-track headers (#4); however, it feels a bit odd to say "they probably don't mean us" and skip them... I can almost guarantee you'll have at least one person making a vocal fuss about not being able to opt-out without an account. If we were to honour these headers, would it make a significant change to the amount of data available? Would it likely skew it any more than leaving off logged-in users? * Option 3 does releases one further piece of information over and above those listed - an approximate ratio of logged in versus non-logged-in pageviews for a page. I cannot see any particular problem with doing this (and I can think of a couple of fun things to use it for) but it's probably worth being aware. Andrew. On 13 January 2015 at 07:26, Dario Taraborelli wrote: > I’m sharing a proposal that Reid Priedhorsky and his collaborators at Los > Alamos National Laboratory recently submitted to the Wikimedia Analytics Team > aimed at producing privacy-preserving geo-aggregates of Wikipedia pageview > data dumps and making them available to the public and the research > community. [1] > > Reid and his team spearheaded the use of the public Wikipedia pageview dumps > to monitor and forecast the spread of influenza and other diseases, using > language as a proxy for location. This proposal describes an aggregation > strategy adding a geographical dimension to the existing dumps. > > Feedback on the proposal is welcome on the lists or the project talk page on > Meta [3] > > Dario > > [1] > https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pageviews > [2] http://dx.doi.org/10.1371/journal.pcbi.1003892 > [3] > https://meta.wikimedia.org/wiki/Research_talk:Geo-aggregation_of_Wikipedia_pageviews > ___ > Analytics mailing list > analyt...@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics -- - Andrew Gray andrew.g...@dunelm.org.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l