[Wiki-research-l] Machine-utilizable Crowdsourced Lexicons

Adam Sobieski Tue, 29 May 2018 18:02:05 -0700

INTRODUCTION

Machine-utilizable lexicons can enhance a great number of speech and natural 
language technologies. Scientists, engineers and technologists – linguists, 
computational linguists and artificial intelligence researchers – eagerly await 
the advancement of machine lexicons which include rich, structured metadata and 
machine-utilizable definitions.


Wiktionary, a collaborative project to produce a free-content multilingual 
dictionary, aims to describe all words of all languages using definitions and 
descriptions. The Wiktionary project, brought online in 2002, includes 139 
spoken languages and American sign language [1].

This letter hopes to inspire exploration into and discussion regarding machine 
wiktionaries, machine-utilizable crowdsourced lexicons, and services which 
could exist at https://machine.wiktionary.org/ .

LEXICON EDITIONING

The premise of editioning is that one version of the resource can be more or 
less frozen, e.g. a 2018 edition, while wiki editors collaboratively work on a 
next version, e.g. a 2019 edition. Editioning can provide stability for complex 
software engineering scenarios utilizing an online resource. Some software 
engineering teams, however, may choose to utilize fresh dumps or data exports 
of the freshest edition.

SEMANTIC WEB

A machine-utilizable lexicon could include a semantic model of its contents and 
a SPARQL endpoint.

MACHINE-UTILIZABLE DEFINITIONS

Machine-utilizable definitions, available in a number of knowledge 
representation formats, can be granular, detailed and nuanced.

There exist a large number of use cases for machine-utilizable definitions. One 
use case is providing natural language processing components with the 
capabilities to semantically interpret natural language, to utilize automated 
reasoning to disambiguate lexemes, phrases and sentences in contexts. Some 
contend that the best output after a natural language processing component 
processes a portion of natural language is each possible interpretation, 
perhaps weighted via statistics. In this way, (1) natural language processing 
components could process ambiguous language, (2) other components, e.g. 
automated reasoning components, could narrow sets of hypotheses utilizing 
dialogue contexts, (3) other components, e.g. automated reasoning components, 
could narrow sets of hypotheses utilizing knowledgebase content, and (4) 
mixed-initiative dialogue systems could also ask users questions to narrow sets 
of hypotheses. Such disambiguation and interpretation would utilize 
machine-utilizable definitions of senses of lexemes.

CONJUGATION, DECLENSION AND THE URL-BASED SPECIFICATION OF LEXEMES AND LEXICAL 
PHRASES

A grammatical category [2] is a property of items within the grammar of a 
language; it has a number of possible values, sometimes called grammemes, which 
are normally mutually exclusive within a given category. Verb conjugation, for 
example, may be affected by the grammatical categories of: person, number, 
gender, tense, aspect, mood, voice, case, possession, definiteness, politeness, 
causativity, clusivity, interrogativity, transitivity, valency, polarity, 
telicity, volition, mirativity, evidentiality, animacy, associativity, 
pluractionality, reciprocity, agreement, polypersonal agreement, incorporation, 
noun class, noun classifiers, and verb classifiers in some languages [3].

By combining the grammatical categories from each and every language together, 
we can precisely specify a conjugation or declension. For example, the URL:

https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-US&lemma=fly&category=verb&person=first-person&number=singular&tense=past&aspect=past_simple&mood=indicative&…

includes an edition, a language of a lemma, a lemma, a lexical category, and 
conjugates (with ellipses) the verb in a language-independent manner.

We can further specify, via URL query string, the semantic sense of a 
grammatical element:

https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-US&lemma=fly&category=verb&person=first-person&number=singular&tense=past&aspect=past_simple&mood=indicative&...&sense=4

Specifying a grammatical item fully in a URL query string, as indicated in the 
previous examples, could result in a redirection to another URL.

That is, the URL:

https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-US&lemma=fly&category=verb&person=first-person&number=singular&tense=past&aspect=past_simple&mood=indicative&…

could redirect to:

https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678

or to:

https://machine.wiktionary.org/wiki/2018/12345678/

and the URL with a specified semantic sense:

https://machine.wiktionary.org/wiki/lookup.php?edition=2018&language=en-US&lemma=fly&category=verb&person=first-person&number=singular&tense=past&aspect=past_simple&mood=indicative&...&sense=4

could redirect to:

https://machine.wiktionary.org/wiki/index.php?edition=2018&id=12345678&sense=4

or to:

https://machine.wiktionary.org/wiki/2018/12345678/4/

The URL https://machine.wiktionary.org/wiki/2018/12345678/ is intended to 
indicate a conjugation or declension with one or more meanings or senses. The 
URL https://machine.wiktionary.org/wiki/2018/12345678/4/ is intended to 
indicate a specific sense or definition of a conjugation or declension. A 
feature from having URL’s for both conjugations or declensions and for specific 
meanings or senses is that HTTP request headers can specify languages and 
content types of the output desired for a particular URL.

The provided examples intended to indicate that each complete, 
language-independent conjugation or declension can have an ID number as opposed 
to each headword or lemma. Instead of one ID number for all variations of 
“fly”, there is one ID number for “flew”, another for “have flown”, another for 
“flying”, and one for each conjugation or declension. Reasons for indexing the 
conjugations and declensions instead of traditional headwords or lemmas include 
that, at least for some knowledge representation formats, the formal semantics 
of the definitions vary per conjugation or declension.

CONCLUSION

This letter broached machine wiktionaries and some of the services which could 
exist at https://machine.wiktionary.org/ . It is my hope that this letter 
indicated a few of the many exciting topics with regard to machine-utilizable 
crowdsourced lexicons.


REFERENCES

[1] https://en.wiktionary.org/wiki/Index:All_languages#List_of_languages
[2] https://en.wikipedia.org/wiki/Grammatical_category
[3] https://en.wikipedia.org/wiki/Grammatical_conjugation
[4] https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Request_fields
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

[Wiki-research-l] Machine-utilizable Crowdsourced Lexicons

Reply via email to