Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-21 Thread Daniel Kinzler
Hi all!

Sorry for the delay. To keep the conversation in one place, I will reply to
David, Denny, and Philipp in one mail. It's going to be a bit long, sorry...

Am 11.11.2016 um 23:17 schrieb David Cuenca Tudela:
> Hi there!
> 
> 1) a possible solution could be to have another category of items ("Gxxx",
> grammatical rule?) to store grammatical structures, like "Noun + verb + 
> object"
> or "Noun + reflexive verb" and then linking to that structure with a qualifier
> of the position that it uses on that structure. Example:
> "to shit"  "Subject + reflexive verb + 
> reflexive
> pronoun"
>  "reflexive verb"

I see no need for a separate entity type, this could be done with a regular
Item. If we want this to work nicely for display, though, the software would
need to know about some "magic" properties and their meaning. Since Wikidata
provides a stable global vocabulary, it would not be terrible to hard-code this.
But still, it's special case code...

This is pretty similar to Lemon's "Syntactic Frame" that Philipp pointed out,
see below.

> 2) I would prefer statements as they can be complemented with qualifiers as 
> for
> why it has a certain spelling (geographical variant, old usage, 
> corruption...).

You can always use a statement for this kind of information, just as we do now
on Wikidata with properties for the surname or official name.

The question is how often the flexibility of a statement is really needed. If
it's not too often, it would be ok to require both (the lemma and the statement)
to be entered separately, as we do now for official name, birth name, etc.

Another question is which (multi-term lemma or secondary lemma-in-a-statement)
is easier to handle by a 3rd party consumer. More about that later.

> It would be nice however if there would be some mechanism to have a special 
> kind
> of property that would use its value as an item alias. And this is something
> that could benefit normal items in Wikidata too, as most name properties like
> P1448, P1477 (official name, birth name, etc), should have its value
> automatically show as alias of the item in all languages, if that were
> technologically feasible.

Yes, this would be very convenient. But it would also mix levels of content
(editorial vs. sourced) that are now nicely separated. I'm very tempted, but I'm
not sure it's worth it.

Am 12.11.2016 um 00:08 schrieb Denny Vrandečić:
> Not only that. "I shit myself" is very different from "Don't shit yourself".
> It is not just the reflexivity. It might the whole phrase.

Yes, the boundary to a phrase is not clear cut. But if we need the full power of
modeling as a phrase, we can always do that by creating a separate Lexeme for
the phrase. The question is if that should be the preferred or even the only way
to model the "syntactic frame".

It's typical for a dictionary to have a list of meanings structured like this:

  to ask
  to ask so. sth.
  to ask so. for sth.
  to ask so. about sth.
  to ask so. after sb.
  to ask so. out
  ...

It would be nice if we had an easy way to create such an overview. If each line
is modeled as a separate Lexeme, we need to decide how these Lexemes should be
connected to allow such an overview.

I feel these "frames" should be attached to senses. Making all of them separate
Lexemes will drive granularity up, making things hard to follow and maintain.

> We could also add this information as a special field in the Sense
> entity, but I don't even know what that field should contain, exactly.

It could be a reference to an Item. Perhaps that item defines a specific
pattern, like "$verb someone" or "$verb someone something" or "$verb oneself".
That pattern (defined by a statement on the item) can then be used to render the
concrete pattern for each word sense.

> Just a usage example on the sense? That would often be enough to express the
> proposition.

Possible, but then it's unclear which parts of the grammar are required to
generate a specific meaning. You'd need some kind of markup in the example,
which I would like to avoid.

> I am not a friend of multi-variant lemmas. I would prefer to either have
> separate Lexemes or alternative Forms. Yes, there will be duplication in the
> data, but this is expected already, and also, since it is machine-readable,
> the duplication can be easily checked and bot-ified.

Getting rid of bots that keep duplicate data in sync was one of the reasons we
created Wikidata, and one of it's major selling points. Bots have a lot of uses,
but copying data around isn't really a good one.

Also, how do you sync deletions? Reverts? The semantics is not trivial.

> Also, this is how Wiktionary works today:
> https://en.wiktionary.org/wiki/colour
> https://en.wiktionary.org/wiki/color
>
> Notice that there is no primacy of either.

True. But that's not how other dictionaries work:

https://dict.leo.org/ende/index_de.html#/search=color
http://www.merriam-webster.com/dictionary/colour

Re: [Wikidata-tech] Fwd: Two questions about Lexeme Modeling

2016-11-21 Thread Philipp Cimiano

Dear Denny, Daniel,

 thanks for your question. I try to answer.

ad 1) "ask somebody about" and "ask somebody to" are two different 
syntactic and semantic frames.


Please look at the final spec of the lemon model:

https://www.w3.org/community/ontolex/wiki/Final_Model_Specification#Syntactic_Frames

In particular, check example: synsem/example7

There you see two different syntactic frames for the word "give". In 
this case they both represent the same sense corresponding to an 
exchange of goods but with different syntactic construcitons.


In your case for "ask" there would be also two syntactic frames, but two 
senses instead of one.


If you want I can send you a modelled example.

2) Such spelling variants are modelled in lemon as two different 
representations of the same lexical entry.


See ontolex/example3 in the above mentioned spec. After all, it is the 
same word with the same meanings and same pronunciation but just with a 
different spelling for each dialect of English.


In our understanding these are not two different forms as you mention, 
but two different spellings of the same form.


A form represents a particular grammatical variant, not a spelling 
variant. In this case it is the singular form of the noun. But both 
spellings really represent the same (grammatical) form, that is the 
singular form of the noun.


You do not need to specify one main written representation for each 
form, as both are valid depending on the context.


The preference for showing e.g. the American or English variant should 
be stated by the application that uses the lexicon.


Does this help?

Philipp

Am 11.11.16 um 20:07 schrieb Denny Vrandečić:
The Wikidata Lexeme model is basically based on Lemon, so I wanted to 
ask you whether you have answers for the following questions in Lemon?


Feel free to answer directly to the list:

https://lists.wikimedia.org/pipermail/wikidata-tech/2016-November/001057.html 



Cheers,
Denny



-- Forwarded message -
From: Daniel Kinzler >

Date: Fri, Nov 11, 2016 at 9:03 AM
Subject: [Wikidata-tech] Two questions about Lexeme Modeling
To: wikidata-tech >



Hi all!

There is two questions about modelling lexemes that are bothering me. 
One is an

old question, and one I only came across recently.

1) The question that came up for me recently is how we model the 
grammatical
context for senses. For instance, "to ask" can mean requesting 
information, or
requesting action, depending on whether we use "ask somebody about" or 
"ask
somebody to". Similarly, "to shit" has entirely different meanings 
when used

reflexively ("I shit myself").

There is no good place for this in our current model. The information 
could be
placed in a statement on the word Sense, but that would be kind of 
non-obvious,
and would not (at least not easily) allow for a concise rendering, in 
the way we
see it in most dictionaries ("to ask sbdy to do sthg"). The 
alternative would be
to treat each usage with a different grammatical context as a separate 
Lexeme (a
verb phrase Lexeme), so "to shit oneself" would be a separate lemma. 
That could
lead to a fragmentation of the content in a way that is quite 
unexpected to

people used to traditional dictionaries.

We could also add this information as a special field in the Sense 
entity, but I

don't even know what that field should contain, exactly.

Got a better idea?


2) The older question is how we handle different renderings 
(spellings, scripts)
of the same lexeme. In English we have "color" vs "colour", in German 
we have

"stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and
Cyrillic rendering for every word. We can treat these as separate 
Lexemes, but
that would mean duplicating all information about them. We could have 
a single
Lemma, and represent the others as alternative Forms, or using 
statements on the
Lexeme. But that raises the question which spelling or script should 
be the

"main" one, and used in the lemma.

I would prefer to have multi-variant lemmas. They would work like the
multi-lingual labels we have now on items, but restricted to the 
variants of a

single language. For display, we would apply a similar language fallback
mechanism we now apply when showing labels.

2b) if we treat lemmas as multi-variant, should Forms also be 
multi-variant, or
should they be per-variant? Should the glosse of a Sense be 
multi-variant? I

currently tend towards "yes" for all of the above.


What do you think?


--
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org 


https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


--
--
Prof. Dr. Philipp