Regarding data, I have not been a part of these projects but I think that I
can help a bit with working links:
* The (I believe) original dataset can also be found here:
https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/
* A newer version of this dataset was produced that also included
information about whether the source was openly available and its topic:
** Meta page:
https://meta.wikimedia.org/wiki/Research:Towards_Modeling_Citation_Quality
** Figshare:
https://figshare.com/articles/Accessibility_and_topics_of_citations_with_identifiers_in_Wikipedia/6819710

On Mon, Aug 26, 2019 at 3:53 AM Federico Leva (Nemo) <nemow...@gmail.com>
wrote:

> Greg, 22/08/19 06:19:
> > I do not know the current status of wikicite or if/when this
> > could be used for this inquiry--either to examine all, or a sensible
> subset
> > of the citations.
>
> If I see correctly, you still did not receive an answer on the data
> available.
>
> It's true that the Figshare item for
> <
> https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wikipedia>
>
> was deleted (I've asked about it on the talk page), but it's trivial to
> run https://pypi.org/project/mwcites/ and extract the data yourself, at
> least for citations which use an identifier.
>
> Some example datasets produced this way:
> https://zenodo.org/record/15871
> https://zenodo.org/record/55004
> https://zenodo.org/record/54799
>
> Once you extract the list of works, the fun begins. You'll need to
> intersect with other data sources (Wikidata, ORCID, other?) and account
> for a number of factors until you manage to find a subset of the data
> which has a sufficiently high signal:noise ratio. For instance you might
> need to filter or normalise by
> * year of publication (some year recent enough to have good data but old
> enough to allow the work to be cited elsewhere, be archived after
> embargos);
> * country or institution (some probably have better ORCID coverage);
> * field/discipline and language;
> * open access status (per Unpaywall);
> * number of expected pageviews and clicks (for instance using
> <https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews> and
> <https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases>;
>
> a link from 10k articles on asteroids or proteins is not the same as
> being the lone link from a popular article which is not the same as a
> link buried among a thousand others on a big article);
> * time or duration of the addition (with one of the various diff
> extraction libraries, content persistence data or possibly historical
> eventstream if such a thing is available).
>
> To avoid having to invent everything yourself, maybe you can reuse the
> method of some similar study, for instance the one on the open access
> citation advantage or one of the many which studied the gender imbalance
> of citations and peer review in journals.
>
> However, it's very possible that the noise is just too much for a
> general computational method. You might consider a more manual approach
> on a sample of relevant events, for instance the *removal* of citations,
> which is in my opinion more significant than the addition.* You might
> extract all the diffs which removed a citation from an article in the
> last N years (probably they'll be in the order of 10^5 rather than
> 10^6), remove some massive events or outliers, sample 500-1000 of them
> randomly and verify the required data manually.
>
> As usual it will be impossible to have an objective assessment of
> whether that citation was really (in)appropriate in that context
> according to the (English or whatever) Wikipedia guidelines. To test
> that too, you should replicate one of the various studies of the gender
> imbalance of peer review, perhaps one of those which tried to assess the
> impact of a double blind peer review system on the gender imbalance.
> However, because the sources are already published, you'd need to
> provide the agendered information yourself and make sure the
> participants perform their assessment in some controlled environment
> where they don't have access to any gendered information (i.e. where you
> cut them off the internet).
>
> How many years do you have to work on this project? :-)
>
> Federico
>
> (*) I might add a citation just because it's the first result a popular
> search engine gives me, after glancing at the abstract and maybe the
> journal home page; but if I remove an existing citation, hopefully I've
> at least assessed its content and made a judgement about it, apart from
> cases of mass removals for specific problems with certain articles or
> publication venues.
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


-- 
Isaac Johnson -- Research Scientist -- Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to