Regarding data, I have not been a part of these projects but I think that I can help a bit with working links: * The (I believe) original dataset can also be found here: https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/ * A newer version of this dataset was produced that also included information about whether the source was openly available and its topic: ** Meta page: https://meta.wikimedia.org/wiki/Research:Towards_Modeling_Citation_Quality ** Figshare: https://figshare.com/articles/Accessibility_and_topics_of_citations_with_identifiers_in_Wikipedia/6819710
On Mon, Aug 26, 2019 at 3:53 AM Federico Leva (Nemo) <nemow...@gmail.com> wrote: > Greg, 22/08/19 06:19: > > I do not know the current status of wikicite or if/when this > > could be used for this inquiry--either to examine all, or a sensible > subset > > of the citations. > > If I see correctly, you still did not receive an answer on the data > available. > > It's true that the Figshare item for > < > https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wikipedia> > > was deleted (I've asked about it on the talk page), but it's trivial to > run https://pypi.org/project/mwcites/ and extract the data yourself, at > least for citations which use an identifier. > > Some example datasets produced this way: > https://zenodo.org/record/15871 > https://zenodo.org/record/55004 > https://zenodo.org/record/54799 > > Once you extract the list of works, the fun begins. You'll need to > intersect with other data sources (Wikidata, ORCID, other?) and account > for a number of factors until you manage to find a subset of the data > which has a sufficiently high signal:noise ratio. For instance you might > need to filter or normalise by > * year of publication (some year recent enough to have good data but old > enough to allow the work to be cited elsewhere, be archived after > embargos); > * country or institution (some probably have better ORCID coverage); > * field/discipline and language; > * open access status (per Unpaywall); > * number of expected pageviews and clicks (for instance using > <https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews> and > <https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases>; > > a link from 10k articles on asteroids or proteins is not the same as > being the lone link from a popular article which is not the same as a > link buried among a thousand others on a big article); > * time or duration of the addition (with one of the various diff > extraction libraries, content persistence data or possibly historical > eventstream if such a thing is available). > > To avoid having to invent everything yourself, maybe you can reuse the > method of some similar study, for instance the one on the open access > citation advantage or one of the many which studied the gender imbalance > of citations and peer review in journals. > > However, it's very possible that the noise is just too much for a > general computational method. You might consider a more manual approach > on a sample of relevant events, for instance the *removal* of citations, > which is in my opinion more significant than the addition.* You might > extract all the diffs which removed a citation from an article in the > last N years (probably they'll be in the order of 10^5 rather than > 10^6), remove some massive events or outliers, sample 500-1000 of them > randomly and verify the required data manually. > > As usual it will be impossible to have an objective assessment of > whether that citation was really (in)appropriate in that context > according to the (English or whatever) Wikipedia guidelines. To test > that too, you should replicate one of the various studies of the gender > imbalance of peer review, perhaps one of those which tried to assess the > impact of a double blind peer review system on the gender imbalance. > However, because the sources are already published, you'd need to > provide the agendered information yourself and make sure the > participants perform their assessment in some controlled environment > where they don't have access to any gendered information (i.e. where you > cut them off the internet). > > How many years do you have to work on this project? :-) > > Federico > > (*) I might add a citation just because it's the first result a popular > search engine gives me, after glancing at the abstract and maybe the > journal home page; but if I remove an existing citation, hopefully I've > at least assessed its content and made a judgement about it, apart from > cases of mass removals for specific problems with certain articles or > publication venues. > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- Isaac Johnson -- Research Scientist -- Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l