Re: [Wikidata-tech] Two questions about Lexeme Modeling

Denny Vrandečić Fri, 11 Nov 2016 15:09:07 -0800

Ugh, tough ones. I hope someone with a computer linguistics background will
chime in, or check the Lemon models for answers.

I put my answers in-line.

On Fri, Nov 11, 2016 at 9:03 AM Daniel Kinzler <daniel.kinz...@wikimedia.de>
wrote:

> 1) The question that came up for me recently is how we model the
> grammatical
> context for senses. For instance, "to ask" can mean requesting
> information, or
> requesting action, depending on whether we use "ask somebody about" or "ask
> somebody to". Similarly, "to shit" has entirely different meanings when
> used
> reflexively ("I shit myself").
>

Not only that. "I shit myself" is very different from "Don't shit
yourself". It is not just the reflexivity. It might the whole phrase.

Looking at https://en.wiktionary.org/wiki/ask , we currently do not have
the word "about" on this page. We have a list of different senses, each
with usage examples, and that would work well in the current model. Indeed,
the question is whether "ask somebody about" belongs here or not. "ask
somebody their age" or "ask somebody for the way" works equally well.

Looking at https://en.wiktionary.org/wiki/shit#Verb the reflexive form is
indeed mentioned on its own page:
https://en.wiktionary.org/wiki/shit_oneself#English - I guess that would
indicate its own Lexeme?

We could also add this information as a special field in the Sense entity,
> but I
> don't even know what that field should contain, exactly.
>

Just a usage example on the sense? That would often be enough to express
the proposition.

2) The older question is how we handle different renderings (spellings,
> scripts)
> of the same lexeme. In English we have "color" vs "colour", in German we
> have
> "stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and
> Cyrillic rendering for every word. We can treat these as separate Lexemes,
> but
> that would mean duplicating all information about them. We could have a
> single
> Lemma, and represent the others as alternative Forms, or using statements
> on the
> Lexeme. But that raises the question which spelling or script should be the
> "main" one, and used in the lemma.
>
> I would prefer to have multi-variant lemmas. They would work like the
> multi-lingual labels we have now on items, but restricted to the variants
> of a
> single language. For display, we would apply a similar language fallback
> mechanism we now apply when showing labels.
>

I am not a friend of multi-variant lemmas. I would prefer to either have
separate Lexemes or alternative Forms. Yes, there will be duplication in
the data, but this is expected already, and also, since it is
machine-readable, the duplication can be easily checked and bot-ified.

Also, this is how Wiktionary works today:
https://en.wiktionary.org/wiki/colour
https://en.wiktionary.org/wiki/color

Notice that there is no primacy of either.

Having multi-variant lemmas seem to complicate the situation a lot. I think
it is important to have only one single Lemma for each Lexeme, in order to
keep display logic simple - the display logic which will also be important
in tools like the query service and every place that displays the data, not
only Wikidata. Multi-variant lemmas are a good idea if you have entities
that you look at in a specific language - like Wikidata's display of Items
- but it is a bad idea for lexical data.

Examples of why this is bad: how would you say that the British English
version is the same as the American English? You use fallback so you don't
have to duplicate it. But what is the difference for an entry that doesn't
have a BE variant in order to reduce redundancy vs an entry that doesn't
have a BE variant because it has not been entered yet. Statements and
Forms, or a a separate Lemma would both solve that issue. Lemmas do not
have the capability and flexibility of statements.

How do you determine the primacy of the American or British English
version? Fallback would be written into the code base, it would not be
amenable to community editing through the wiki.

Whether separate Lexemes or alternative Forms are better might be different
from language to language, from case to case. By hard-coding the
multi-variant lemmas, you not only pre-decided the case, but also made the
code and the data model much more complicated. And not only for the initial
development, but for perpetuity, whenever the data is used.

> What do you think?
>

We shouldn't force for perfection and covering everything from the
beginning. I expect that with the lexical information in the data, Wikidata
will continue to evolve. If not every case can be ideally modeled, but we
can capture 99.9% - well, that's enough to get started, and then see later
how the exceptions will be handled. Also, there is always Wiktionary as the
layer on top of Wikidata that actually can easily resolve these issues
anyway.

Once we have the simple pieces working, we can actually try to understand
where the machinery is creaking and not working well, and then think about
these issues. But until then I would prefer to keep the system as dumb and
simple as possible.

Hope that makes sense,
Denny

_______________________________________________
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Reply via email to