[Wikidata] Fwd: [Abstract-wikipedia] Newsletter #18: Two prototype tools to visualize lexicographic coverage in Wikidata

Lydia Pintscher Thu, 11 Feb 2021 11:24:15 -0800

Hi everyone :)

Forwarding this message from Denny as this is very relevant for our work on
lexicographical data in Wikidata and helping us understand where the gabs
are and how to fill them best.

Cheers
Lydia

---------- Forwarded message ---------
From: Denny Vrandečić <dvrande...@wikimedia.org>
Date: Wed, Feb 10, 2021 at 11:38 PM
Subject: [Abstract-wikipedia] Newsletter #18: Two prototype tools to
visualize lexicographic coverage in Wikidata
To: Abstract Wikipedia list <abstract-wikipe...@lists.wikimedia.org>

The on-wiki version of this newsletter is here:
https://meta.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2021-02-10

The goal of Abstract Wikipedia is to generate natural language texts from
an abstract representation of the content to be represented. In order to do
so, we will use lexicographic data from Wikidata. And although we are quite
far from being able to generate texts, one thing that we want to encourage
everyone’s help with is the coverage and completeness of the lexicographic
data in Wikidata.

Today we want to present prototypes of two tools that could help people to
visualize, exemplify, and better guide our understanding of the coverage of
lexicographic data in Wikidata.
Annotation interface

The first prototype is an annotation interface that allows users to
annotate sentences in any language, associating each word or expression
with a Lexeme from Wikidata, including picking its Form and Sense.

You can see an example in the screenshot below. Each ‘word’ of the sentence
here is annotated with a Lexeme (the Lexeme ID L31818
<https://www.wikidata.org/wiki/Lexeme:L31818> is given just under the
word), followed by the lemma, the language, and the part of speech. Then
comes, if selected, the specific Form that is being used in context - for
example, on ‘dignity’ we see the Form ID L31818#F1, which is the singular
Form of the Lexeme. Lastly, comes the Sense, which is assigned Sense ID
L31818#S1 and defined by a gloss.

At any time, you can remove any of the annotations, or add new annotations.
Some of the options will take you directly to Wikidata. For example, if you
want to add a Sense to a given Lexeme, because it has no Senses or is
missing the one you need, it will take you to Wikidata and let you do that
there in the normal fashion. Once added there, you can come back and select
the newly added Sense.

The user interface of the prototype is a bit slow, so please give it a few
seconds when you initiate an action. It should work out of the box in
different languages. The Universal Language Selector is available (at the
top of the page), which you can use to change the language. Note that
glosses of Senses are frequently only available in the language of the
Lexeme, and the UI doesn’t yet do language fallback, so if you look at
English sentences with a German UI you might often find missing glosses.

Technologically, this is a prototype entirely implemented in JavaScript and
CSS on top of a vanilla MediaWiki installation. This is likely not the best
possible technical solution for such a system, but should help to determine
if there is any user-interest in the tool, for a potential
reimplementation. Also, it would be a fascinating task to agree on an API
which can be implemented by other groups to provide the selection of
Lexemes, Senses, and Forms for input sentences. The current baseline here
is extremely simple, and would not be good enough for an automated tagging
system. Having this available for many sentences in many languages could
provide a great corpus for training natural language understanding systems.
There is a lot that could be built upon that.

The goal of this prototype is to make more tangible the Wikidata
community's progress regarding the coverage of the lexicographical data.
You can take a sentence in any written language, put it into this system,
and find out how complete you can get with your annotations. It's a way to
showcase and create anecdotal experience of the lexicographic data in
Wikidata.

The prototype annotation interface is at:

http://annotation.wmcloud.org/

You can discuss it here:

https://annotation.wmcloud.org/wiki/Discussion
(You will need to create a new account - if you have time to set this up
with SUL, drop me a line)
Corpus coverage dashboard

The second prototype tool is a dashboard that shows the coverage of the
data compared to a corpus in each of forty languages.

Last year, whilst in my previous position at Google Research, I co-authored
a publication where we built and published language models out of the
cleaned-up text of about forty Wikipedia language editions [1]. Besides the
language models, we also published the raw data: this text has been cleaned
up by the pre-processing system that Google uses on Wikipedia text in order
to integrate the text in several of its features. So while this dataset
consists of relatively clean natural language text; certainly, compared to
the raw wiki text — it still contains plenty of artefacts. If you know of
better large scale encyclopedic text corpora we can use, maybe better
cleaned-up versions of Wikipedia, or ones covering more languages, please let
us know <https://phabricator.wikimedia.org/T273221>.

We extracted these texts from the TensorFlow models
<https://www.tensorflow.org/datasets/catalog/wiki40b>. We provide the extracted
texts for download
<https://drive.google.com/drive/folders/1HfL138UCqr69w0XfAhlAEUh6VVOnzwBE>
(a task <https://phabricator.wikimedia.org/T274208> to move it to Wikimedia
servers is underway). We split the text into tokens and count the
occurrences of words, and compared how many of these tokens appear in the
Forms on Lexemes of the given language in Wikidata’s lexicographic data. If
this proves useful, we could move the cleaned-up text to a more permanent
home.

A screenshot of the current state for English is given here: we see how
many Forms for this language are available in Wikidata, and we see how many
different Forms are attested in Wikipedia (i.e., how many different words,
or word types, are in the Wikipedia of the given language). The number of
tokens is the total number of words in the given language corpus. Covered
forms says how many of the forms in the corpus are also in Wikidata's
Lexeme set, and covered tokens tells us how many of the occurrences that
covers (so, if the word ‘time’ appears 100 times in English Wikipedia, it
would be counted as one covered form, but 100 covered tokens). The two pie
charts visualize the coverage of forms and tokens respectively.

Finally, there is a link to the thousand most frequent forms that are not
yet in Wikidata. This can help communities prioritise ramping up coverage
quickly. Note though, the progress report is manual and does not
automatically update. I plan to run an update from time to time for now.

The prototype corpus coverage dashboard is at:
https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage

You can discuss it here:

https://www.wikidata.org/wiki/Wikidata_talk:Lexicographical_coverage

Help wanted

Both prototype tools are exactly that: prototypes, not real products. We
have not committed to supporting and developing these prototypes further.
At the same time, all of the code and data is of course open sourced. If
anyone would like to pick up the development or maintenance of these
prototypes, you would be more than welcome – please let us know (on my talk
page <https://meta.wikimedia.org/wiki/User_talk:DVrandecic_(WMF)>, or via
e-mail, or on the Tool ideas page
<https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Ideas_of_tools>
).

Also, if someone likes the idea but thinks that a different implementation
would be better, please move ahead with that – I am happy to support and
talk with you. There is much to improve here, but we hope that these two
prototypes will lead to more development of content and tools
<https://www.wikidata.org/wiki/Wikidata:Tools/Lexicographical_data> in the
space of lexicographic data.

[1] Mandy Guo, Zihang Dai, Denny Vrandečić, Rami Al-Rfou: Wiki-40B:
Multilingual Language Model Dataset, LREC 2020,
https://www.aclweb.org/anthology/2020.lrec-1.297/
_______________________________________________
Abstract-Wikipedia mailing list
abstract-wikipe...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia

-- 
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Fwd: [Abstract-wikipedia] Newsletter #18: Two prototype tools to visualize lexicographic coverage in Wikidata

Reply via email to