subject:"\[Wikidata\-tech\] Two questions about Lexeme Modeling"

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread Daniel Kinzler

Am 25.11.2016 um 12:16 schrieb David Cuenca Tudela:
>> If we want to avoid this complexity, we could just go by prefix. So if the
>> languages is "de", variants like "de-CH" or "de-DE_old" would be considered 
>> ok.
>> Ordering these alphabetically would put the "main" code (with no suffix) 
>> first.
>> May be ok for a start.
> 
> I find this issue potentially controversial, and I think that the community at
> large should be involved in this matter to avoid future dissatisfaction and to
> promote involvement in the decision-making.

We should absolutely discuss this with Wiktionarians. My suggestion was intended
as a baseline implementation. Details about the restrictions on which variants
are allowed on a Lexeme, or in what order they are shown, can be changed later
without breaking anything.

> In my opinion it would be more appropriate to use standardized language codes,
> and then specify the dialect with an item, as it provides greater flexibility.
> However, as mentioned before I would prefer if this topic in particular would 
> be
> discussed with wiktionarians.

Using Items to represent dialects is going to be tricky. We need ISO language
codes for use in HTML and RDF. We can somehow map between Items and ISO codes,
but that's going to be messy, especially when that mapping changes.

So it seems like we need to further discuss how to represent a Lexeme's language
and each lemma's variant. My current thinking is to represent the language as an
Item reference, and the variant as an ISO code. But you are suggesting the
opposite.

I can see why one would want items for dialects, but I currently have no good
idea for making this work with the existing technology. Further investigation is
needed.

I have filed a Phabricator task for investiagting this. I suggest to take the
discussion about how to represent languages/variants/dialects/etc there:

https://phabricator.wikimedia.org/T151626

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread Thiemo Mättig

Hi all!

I tweaked my part of the decision matrix a little bit:

https://docs.google.com/spreadsheets/d/1PtGkt6E8EadCoNvZLClwUNhCxC-cjTy5TY8seFVGZMY/edit?ts=5834219d#gid=868938568

The arguments in my matrix are basically a collection of "the worst
things that can happen". I like this approach. ;-)

The arguments I consider most important (they should have a high
number in the last column) are:

1. Changing Term to TermList later is almost impossible. This alone
could be set to a "-100" and make all the other arguments obsolete.

2. I'm very much concerned about any UI consuming Lemmas becoming very
complicated, both from the users and devs perspective. When a Lexeme
allows any number of Lemmas, should this include zero Lemmas? Which
language codes will be allowed? Do we want to enforce at least one
Lemma? Do we need to validate the used language codes, or are
post-edit checks enough? Do we even have standardized language codes
for all variants? Is it possible to have multiple Lemmas with the same
language code? Which Lemma is the primary one then? How to deprecate
one?

The list goes on.

All this sounds like we are going to reimplement the majority of the
statements UI, just without Ranks, Qualifiers and References.

Third-party devs will also have to deal with all these problems (also
see Dennys comments).

I suggest to use a TermList anyway, but to start with a very hard
limitation: It *must* contain exactly one element, and the language
code *must* be the exact same as the language code of the Lexeme. We
can lift all these limitations later when needed, step by step.

Best
Thiemo

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread David Cuenca Tudela

> If we want to avoid this complexity, we could just go by prefix. So if the
> languages is "de", variants like "de-CH" or "de-DE_old" would be
considered ok.
> Ordering these alphabetically would put the "main" code (with no suffix)
first.
> May be ok for a start.

I find this issue potentially controversial, and I think that the community
at large should be involved in this matter to avoid future dissatisfaction
and to promote involvement in the decision-making.

For languages there are regulatory bodies that assign codes, but for
varieties it is not the case, or at least not totally. Even under the en-gb
there are many varieties and dialects
https://en.wikipedia.org/wiki/List_of_dialects_of_the_English_language#United_Kingdom

In my opinion it would be more appropriate to use standardized language
codes, and then specify the dialect with an item, as it provides greater
flexibility. However, as mentioned before I would prefer if this topic in
particular would be discussed with wiktionarians.


Thanks for moving this forward!

David



On Fri, Nov 25, 2016 at 11:45 AM, Daniel Kinzler <
daniel.kinz...@wikimedia.de> wrote:

> Thank you Denny for having an open mind! And sorry for being a nuisance ;)
>
> I think it's very important to have controversial but constructive
> discussions
> about these things. Data models are very hard to change even slightly once
> people have started to create and use the data. We need to try hard to get
> it as
> right as possible off the bat.
>
> Some remarks inline below.
>
> Am 25.11.2016 um 03:32 schrieb Denny Vrandečić:
> > There is one thing that worries me about the multi-lemma approach, and
> that are
> > mentions of a discussion about ordering. If possible, I would suggest
> not to
> > have ordering in every single Lexeme or even Form, but rather to use the
> > following solution:
> >
> > If I understand it correctly, we won't let every Lexeme have every
> arbitrary
> > language anyway, right? Instead we will, for each language that has
> variants
> > have somewhere in the configurations an explicit list of these variants,
> i.e.
> > say, for English it will be US, British, etc., for Portuguese Brazilian
> and
> > Portuguese, etc.
>
> That approach is similar to what we are now doing for sorting Statement
> groups
> on Items. There is a global ordering of properties defined on a wiki page.
> So
> the community can still fight over it, but only in one place :) We can
> re-order
> based on user preference using a Gadget.
>
> For the multi-variant lemmas, we need to declare the Lexeme's language
> separately, in addition to the language code associated with each lemma
> variant.
> It seems like the language will probably represented as reference to a
> Wikidata
> Item (that is, a Q-Id). That Item can be associated with an (ordered) list
> of
> matching language codes, via Statements on the Item, or via configuration
> (or,
> like we do for unit conversion, configuration generated from Statements on
> Items).
>
> If we want to avoid this complexity, we could just go by prefix. So if the
> languages is "de", variants like "de-CH" or "de-DE_old" would be
> considered ok.
> Ordering these alphabetically would put the "main" code (with no suffix)
> first.
> May be ok for a start.
>
> I'm not sure yet on what level we want to enforce the restriction on
> language
> codes. We can do it just before saving new data (the "validation" step),
> or we
> could treat it as a community enforced soft constraint. I'm tending
> towards the
> former, though.
>
> > Given that, we can in that very same place also define their ordering
> and their
> > fallbacks.
>
> Well, all lemmas would fall back on each other, the question is just which
> ones
> should be preferred. Simple heuristic: prefer the shortest language code.
> Or go
> by what MediaWiki does fro the UI (which is what we do for Item labels).
>
> > The upside is that it seems that this very same solution could also be
> used for
> > languages with different scripts, like Serbian, Kazakh, and Uzbek
> (although it
> > would not cover the problems with Chinese, but that wasn't solved
> previously
> > either - so the situation is strictly better). (It doesn't really solve
> all
> > problems - there is a reason why ISO treats language variants and scripts
> > independently - but it improves on the vast majority of the problematic
> cases).
>
> Yes, it's not the only decision we have to make in this regard, but the
> most
> fundamental one, I think.
>
> One consequence of this is that Forms should probably also allow multiple
> representations/spellings. This is for consistency with the lemma, for code
> re-use, and for compatibility with Lemon.
>
> > So, given that we drop any local ordering in the UI and API, I think that
> > staying close to Lemon and choosing a TermList seems currently like the
> most
> > promising approach to me, and I changed my mind.
>
> Knowing that you won't do that without a good reason, I thank you for the
> com

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread Daniel Kinzler

Thank you Denny for having an open mind! And sorry for being a nuisance ;)

I think it's very important to have controversial but constructive discussions
about these things. Data models are very hard to change even slightly once
people have started to create and use the data. We need to try hard to get it as
right as possible off the bat.

Some remarks inline below.

Am 25.11.2016 um 03:32 schrieb Denny Vrandečić:
> There is one thing that worries me about the multi-lemma approach, and that 
> are
> mentions of a discussion about ordering. If possible, I would suggest not to
> have ordering in every single Lexeme or even Form, but rather to use the
> following solution:
> 
> If I understand it correctly, we won't let every Lexeme have every arbitrary
> language anyway, right? Instead we will, for each language that has variants
> have somewhere in the configurations an explicit list of these variants, i.e.
> say, for English it will be US, British, etc., for Portuguese Brazilian and
> Portuguese, etc.

That approach is similar to what we are now doing for sorting Statement groups
on Items. There is a global ordering of properties defined on a wiki page. So
the community can still fight over it, but only in one place :) We can re-order
based on user preference using a Gadget.

For the multi-variant lemmas, we need to declare the Lexeme's language
separately, in addition to the language code associated with each lemma variant.
It seems like the language will probably represented as reference to a Wikidata
Item (that is, a Q-Id). That Item can be associated with an (ordered) list of
matching language codes, via Statements on the Item, or via configuration (or,
like we do for unit conversion, configuration generated from Statements on 
Items).

If we want to avoid this complexity, we could just go by prefix. So if the
languages is "de", variants like "de-CH" or "de-DE_old" would be considered ok.
Ordering these alphabetically would put the "main" code (with no suffix) first.
May be ok for a start.

I'm not sure yet on what level we want to enforce the restriction on language
codes. We can do it just before saving new data (the "validation" step), or we
could treat it as a community enforced soft constraint. I'm tending towards the
former, though.

> Given that, we can in that very same place also define their ordering and 
> their
> fallbacks.

Well, all lemmas would fall back on each other, the question is just which ones
should be preferred. Simple heuristic: prefer the shortest language code. Or go
by what MediaWiki does fro the UI (which is what we do for Item labels).

> The upside is that it seems that this very same solution could also be used 
> for
> languages with different scripts, like Serbian, Kazakh, and Uzbek (although it
> would not cover the problems with Chinese, but that wasn't solved previously
> either - so the situation is strictly better). (It doesn't really solve all
> problems - there is a reason why ISO treats language variants and scripts
> independently - but it improves on the vast majority of the problematic 
> cases).

Yes, it's not the only decision we have to make in this regard, but the most
fundamental one, I think.

One consequence of this is that Forms should probably also allow multiple
representations/spellings. This is for consistency with the lemma, for code
re-use, and for compatibility with Lemon.

> So, given that we drop any local ordering in the UI and API, I think that
> staying close to Lemon and choosing a TermList seems currently like the most
> promising approach to me, and I changed my mind. 

Knowing that you won't do that without a good reason, I thank you for the
compliment :)

> My previous reservations still
> hold, and it will lead to some more complexity in the implementation not only 
> of
> Wikidata but also of tools built on top of it,

The complexity of handling a multi-variant lemma is higher than a single string,
but any wikibase client already needs to have the relevant code anyway, to
handle item labels. So I expect little overhead. We'll want the lemma to be
represented in a more compact way in the UI than we currently use for labels,
though.


Thank you all for your help!


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-24 Thread Denny Vrandečić

Hi all,

thanks for the two matrices and the input here. I am tending to again let
Daniel convince me about using multiple representations for the lemma and
the forms. Mostly because that's what's closest to Lemon, and I trust the
research and expertise within Lemon. Thank you Philipp for chiming in!

There is one thing that worries me about the multi-lemma approach, and that
are mentions of a discussion about ordering. If possible, I would suggest
not to have ordering in every single Lexeme or even Form, but rather to use
the following solution:

If I understand it correctly, we won't let every Lexeme have every
arbitrary language anyway, right? Instead we will, for each language that
has variants have somewhere in the configurations an explicit list of these
variants, i.e. say, for English it will be US, British, etc., for
Portuguese Brazilian and Portuguese, etc.

Given that, we can in that very same place also define their ordering and
their fallbacks. There is no need to have that being fought out on every
single Lexeme. This will also reduce the complexity of the TermList
solution, and thus bring Thiemo's decision matrix and Daniel's into
alignment regarding their recommendation.

The upside is that it seems that this very same solution could also be used
for languages with different scripts, like Serbian, Kazakh, and Uzbek
(although it would not cover the problems with Chinese, but that wasn't
solved previously either - so the situation is strictly better). (It
doesn't really solve all problems - there is a reason why ISO treats
language variants and scripts independently - but it improves on the vast
majority of the problematic cases).

So, given that we drop any local ordering in the UI and API, I think that
staying close to Lemon and choosing a TermList seems currently like the
most promising approach to me, and I changed my mind. My previous
reservations still hold, and it will lead to some more complexity in the
implementation not only of Wikidata but also of tools built on top of it,
but it seems that the advantages for the Wikidata contributors and a better
scientifically supported data model outweigh this.

I hope that makes sense,
Giving thanks,
Denny

On Tue, Nov 22, 2016 at 3:28 AM Daniel Kinzler 
wrote:

> Am 22.11.2016 um 10:19 schrieb David Cuenca Tudela:
> >> There are many many words with multiple spellings, but not many words
> with
> > more than two, and few with more than three [citation needed].
> >
> > That is not true in languages with a high amount of dialects. For
> instance in
> > Catalan there are 5 standard spellings for "carrot" depending on which
> dialect
> > you choose, plus some more if you consider local variations:
> > https://ca.wikipedia.org/wiki/Pastanaga
>
> How does Lemon handle this? Does it provide some guidance on how to
> display a
> Form with many represenations? Or is that simply left to the application?
>
> You are right that dialects pose a problem here, since they often have
> multiple
> competing spellings (e.g. there's German Low-German and Dutch Low-German -
> mostly same vocabulary, different orthography).
>
> > Additionally the same form can have different meanings depending on which
> > dialect you choose. For instance "pastenaga" means "orange carrot" in
> Catalan
> > from Catalonia, and "purple carrot" in Catalan from Valencia.
> >
> > Which makes me think, how dialects will be handled? Statements?
>
> This is up to the community.  I suppose it will depend on the individual
> case.
> Sometimes, it will be more useful to have a separate lexeme. Sometimes,
> you'd
> have multiple representations (lemmas), plus representation statements with
> qualifiers.
>
>
> --
> Daniel Kinzler
> Senior Software Developer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> ___
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>
___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-22 Thread Daniel Kinzler

Am 12.11.2016 um 00:08 schrieb Denny Vrandečić:
> I am not a friend of multi-variant lemmas. I would prefer to either have
> separate Lexemes or alternative Forms. 

We have created a decision matrix to help with discussing the pros and cons of
the different approaches. PLease have a look and comment:

https://docs.google.com/spreadsheets/d/1PtGkt6E8EadCoNvZLClwUNhCxC-cjTy5TY8seFVGZMY/edit?ts=5834219d#gid=0

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-22 Thread Daniel Kinzler

Am 22.11.2016 um 10:19 schrieb David Cuenca Tudela:
>> There are many many words with multiple spellings, but not many words with
> more than two, and few with more than three [citation needed].
> 
> That is not true in languages with a high amount of dialects. For instance in
> Catalan there are 5 standard spellings for "carrot" depending on which dialect
> you choose, plus some more if you consider local variations:
> https://ca.wikipedia.org/wiki/Pastanaga

How does Lemon handle this? Does it provide some guidance on how to display a
Form with many represenations? Or is that simply left to the application?

You are right that dialects pose a problem here, since they often have multiple
competing spellings (e.g. there's German Low-German and Dutch Low-German -
mostly same vocabulary, different orthography).

> Additionally the same form can have different meanings depending on which
> dialect you choose. For instance "pastenaga" means "orange carrot" in Catalan
> from Catalonia, and "purple carrot" in Catalan from Valencia.
> 
> Which makes me think, how dialects will be handled? Statements?

This is up to the community.  I suppose it will depend on the individual case.
Sometimes, it will be more useful to have a separate lexeme. Sometimes, you'd
have multiple representations (lemmas), plus representation statements with
qualifiers.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-22 Thread David Cuenca Tudela

> There are many many words with multiple spellings, but not many words
with more than two, and few with more than three [citation needed].

That is not true in languages with a high amount of dialects. For instance
in Catalan there are 5 standard spellings for "carrot" depending on which
dialect you choose, plus some more if you consider local variations:
https://ca.wikipedia.org/wiki/Pastanaga

But that is nothing compared to the 8 spellings of tomato or more if you
count the local variations:
https://ca.wikipedia.org/wiki/Tom%C3%A0quet

Additionally the same form can have different meanings depending on which
dialect you choose. For instance "pastenaga" means "orange carrot" in
Catalan from Catalonia, and "purple carrot" in Catalan from Valencia.

Which makes me think, how dialects will be handled? Statements?

This is an example of a dialect map:
https://ca.wikipedia.org/wiki/Dialectes_del_catal%C3%A0#Divisi.C3.B3_dialectal

Regards and thanks for elaborating your long answer,
-d



On Mon, Nov 21, 2016 at 5:45 PM, Daniel Kinzler  wrote:

> Hi all!
>
> Sorry for the delay. To keep the conversation in one place, I will reply to
> David, Denny, and Philipp in one mail. It's going to be a bit long,
> sorry...
>
> Am 11.11.2016 um 23:17 schrieb David Cuenca Tudela:
> > Hi there!
> >
> > 1) a possible solution could be to have another category of items
> ("Gxxx",
> > grammatical rule?) to store grammatical structures, like "Noun + verb +
> object"
> > or "Noun + reflexive verb" and then linking to that structure with a
> qualifier
> > of the position that it uses on that structure. Example:
> > "to shit"  "Subject + reflexive verb +
> reflexive
> > pronoun"
> >  "reflexive verb"
>
> I see no need for a separate entity type, this could be done with a regular
> Item. If we want this to work nicely for display, though, the software
> would
> need to know about some "magic" properties and their meaning. Since
> Wikidata
> provides a stable global vocabulary, it would not be terrible to hard-code
> this.
> But still, it's special case code...
>
> This is pretty similar to Lemon's "Syntactic Frame" that Philipp pointed
> out,
> see below.
>
> > 2) I would prefer statements as they can be complemented with qualifiers
> as for
> > why it has a certain spelling (geographical variant, old usage,
> corruption...).
>
> You can always use a statement for this kind of information, just as we do
> now
> on Wikidata with properties for the surname or official name.
>
> The question is how often the flexibility of a statement is really needed.
> If
> it's not too often, it would be ok to require both (the lemma and the
> statement)
> to be entered separately, as we do now for official name, birth name, etc.
>
> Another question is which (multi-term lemma or secondary
> lemma-in-a-statement)
> is easier to handle by a 3rd party consumer. More about that later.
>
> > It would be nice however if there would be some mechanism to have a
> special kind
> > of property that would use its value as an item alias. And this is
> something
> > that could benefit normal items in Wikidata too, as most name properties
> like
> > P1448, P1477 (official name, birth name, etc), should have its value
> > automatically show as alias of the item in all languages, if that were
> > technologically feasible.
>
> Yes, this would be very convenient. But it would also mix levels of content
> (editorial vs. sourced) that are now nicely separated. I'm very tempted,
> but I'm
> not sure it's worth it.
>
> Am 12.11.2016 um 00:08 schrieb Denny Vrandečić:
> > Not only that. "I shit myself" is very different from "Don't shit
> yourself".
> > It is not just the reflexivity. It might the whole phrase.
>
> Yes, the boundary to a phrase is not clear cut. But if we need the full
> power of
> modeling as a phrase, we can always do that by creating a separate Lexeme
> for
> the phrase. The question is if that should be the preferred or even the
> only way
> to model the "syntactic frame".
>
> It's typical for a dictionary to have a list of meanings structured like
> this:
>
>   to ask
>   to ask so. sth.
>   to ask so. for sth.
>   to ask so. about sth.
>   to ask so. after sb.
>   to ask so. out
>   ...
>
> It would be nice if we had an easy way to create such an overview. If each
> line
> is modeled as a separate Lexeme, we need to decide how these Lexemes
> should be
> connected to allow such an overview.
>
> I feel these "frames" should be attached to senses. Making all of them
> separate
> Lexemes will drive granularity up, making things hard to follow and
> maintain.
>
> > We could also add this information as a special field in the Sense
> > entity, but I don't even know what that field should contain,
> exactly.
>
> It could be a reference to an Item. Perhaps that item defines a specific
> pattern, like "$verb someone" or "$verb someone something" or "$verb
> oneself".
> That pattern (defined by a statement on the item) can then be used to
> re

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-21 Thread Daniel Kinzler

Hi all!

Sorry for the delay. To keep the conversation in one place, I will reply to
David, Denny, and Philipp in one mail. It's going to be a bit long, sorry...

Am 11.11.2016 um 23:17 schrieb David Cuenca Tudela:
> Hi there!
> 
> 1) a possible solution could be to have another category of items ("Gxxx",
> grammatical rule?) to store grammatical structures, like "Noun + verb + 
> object"
> or "Noun + reflexive verb" and then linking to that structure with a qualifier
> of the position that it uses on that structure. Example:
> "to shit"  "Subject + reflexive verb + 
> reflexive
> pronoun"
>  "reflexive verb"

I see no need for a separate entity type, this could be done with a regular
Item. If we want this to work nicely for display, though, the software would
need to know about some "magic" properties and their meaning. Since Wikidata
provides a stable global vocabulary, it would not be terrible to hard-code this.
But still, it's special case code...

This is pretty similar to Lemon's "Syntactic Frame" that Philipp pointed out,
see below.

> 2) I would prefer statements as they can be complemented with qualifiers as 
> for
> why it has a certain spelling (geographical variant, old usage, 
> corruption...).

You can always use a statement for this kind of information, just as we do now
on Wikidata with properties for the surname or official name.

The question is how often the flexibility of a statement is really needed. If
it's not too often, it would be ok to require both (the lemma and the statement)
to be entered separately, as we do now for official name, birth name, etc.

Another question is which (multi-term lemma or secondary lemma-in-a-statement)
is easier to handle by a 3rd party consumer. More about that later.

> It would be nice however if there would be some mechanism to have a special 
> kind
> of property that would use its value as an item alias. And this is something
> that could benefit normal items in Wikidata too, as most name properties like
> P1448, P1477 (official name, birth name, etc), should have its value
> automatically show as alias of the item in all languages, if that were
> technologically feasible.

Yes, this would be very convenient. But it would also mix levels of content
(editorial vs. sourced) that are now nicely separated. I'm very tempted, but I'm
not sure it's worth it.

Am 12.11.2016 um 00:08 schrieb Denny Vrandečić:
> Not only that. "I shit myself" is very different from "Don't shit yourself".
> It is not just the reflexivity. It might the whole phrase.

Yes, the boundary to a phrase is not clear cut. But if we need the full power of
modeling as a phrase, we can always do that by creating a separate Lexeme for
the phrase. The question is if that should be the preferred or even the only way
to model the "syntactic frame".

It's typical for a dictionary to have a list of meanings structured like this:

  to ask
  to ask so. sth.
  to ask so. for sth.
  to ask so. about sth.
  to ask so. after sb.
  to ask so. out
  ...

It would be nice if we had an easy way to create such an overview. If each line
is modeled as a separate Lexeme, we need to decide how these Lexemes should be
connected to allow such an overview.

I feel these "frames" should be attached to senses. Making all of them separate
Lexemes will drive granularity up, making things hard to follow and maintain.

> We could also add this information as a special field in the Sense
> entity, but I don't even know what that field should contain, exactly.

It could be a reference to an Item. Perhaps that item defines a specific
pattern, like "$verb someone" or "$verb someone something" or "$verb oneself".
That pattern (defined by a statement on the item) can then be used to render the
concrete pattern for each word sense.

> Just a usage example on the sense? That would often be enough to express the
> proposition.

Possible, but then it's unclear which parts of the grammar are required to
generate a specific meaning. You'd need some kind of markup in the example,
which I would like to avoid.

> I am not a friend of multi-variant lemmas. I would prefer to either have
> separate Lexemes or alternative Forms. Yes, there will be duplication in the
> data, but this is expected already, and also, since it is machine-readable,
> the duplication can be easily checked and bot-ified.

Getting rid of bots that keep duplicate data in sync was one of the reasons we
created Wikidata, and one of it's major selling points. Bots have a lot of uses,
but copying data around isn't really a good one.

Also, how do you sync deletions? Reverts? The semantics is not trivial.

> Also, this is how Wiktionary works today:
> https://en.wiktionary.org/wiki/colour
> https://en.wiktionary.org/wiki/color
>
> Notice that there is no primacy of either.

True. But that's not how other dictionaries work:

https://dict.leo.org/ende/index_de.html#/search=color
http://www.merriam-webster.com/dictionary/colour
http://www.dictionary.com/bro

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-11 Thread Denny Vrandečić

Ugh, tough ones. I hope someone with a computer linguistics background will
chime in, or check the Lemon models for answers.

I put my answers in-line.

On Fri, Nov 11, 2016 at 9:03 AM Daniel Kinzler 
wrote:

> 1) The question that came up for me recently is how we model the
> grammatical
> context for senses. For instance, "to ask" can mean requesting
> information, or
> requesting action, depending on whether we use "ask somebody about" or "ask
> somebody to". Similarly, "to shit" has entirely different meanings when
> used
> reflexively ("I shit myself").
>

Not only that. "I shit myself" is very different from "Don't shit
yourself". It is not just the reflexivity. It might the whole phrase.

Looking at https://en.wiktionary.org/wiki/ask , we currently do not have
the word "about" on this page. We have a list of different senses, each
with usage examples, and that would work well in the current model. Indeed,
the question is whether "ask somebody about" belongs here or not. "ask
somebody their age" or "ask somebody for the way" works equally well.

Looking at https://en.wiktionary.org/wiki/shit#Verb the reflexive form is
indeed mentioned on its own page:
https://en.wiktionary.org/wiki/shit_oneself#English - I guess that would
indicate its own Lexeme?

We could also add this information as a special field in the Sense entity,
> but I
> don't even know what that field should contain, exactly.
>

Just a usage example on the sense? That would often be enough to express
the proposition.

2) The older question is how we handle different renderings (spellings,
> scripts)
> of the same lexeme. In English we have "color" vs "colour", in German we
> have
> "stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and
> Cyrillic rendering for every word. We can treat these as separate Lexemes,
> but
> that would mean duplicating all information about them. We could have a
> single
> Lemma, and represent the others as alternative Forms, or using statements
> on the
> Lexeme. But that raises the question which spelling or script should be the
> "main" one, and used in the lemma.
>
> I would prefer to have multi-variant lemmas. They would work like the
> multi-lingual labels we have now on items, but restricted to the variants
> of a
> single language. For display, we would apply a similar language fallback
> mechanism we now apply when showing labels.
>

I am not a friend of multi-variant lemmas. I would prefer to either have
separate Lexemes or alternative Forms. Yes, there will be duplication in
the data, but this is expected already, and also, since it is
machine-readable, the duplication can be easily checked and bot-ified.

Also, this is how Wiktionary works today:
https://en.wiktionary.org/wiki/colour
https://en.wiktionary.org/wiki/color

Notice that there is no primacy of either.

Having multi-variant lemmas seem to complicate the situation a lot. I think
it is important to have only one single Lemma for each Lexeme, in order to
keep display logic simple - the display logic which will also be important
in tools like the query service and every place that displays the data, not
only Wikidata. Multi-variant lemmas are a good idea if you have entities
that you look at in a specific language - like Wikidata's display of Items
- but it is a bad idea for lexical data.

Examples of why this is bad: how would you say that the British English
version is the same as the American English? You use fallback so you don't
have to duplicate it. But what is the difference for an entry that doesn't
have a BE variant in order to reduce redundancy vs an entry that doesn't
have a BE variant because it has not been entered yet. Statements and
Forms, or a a separate Lemma would both solve that issue. Lemmas do not
have the capability and flexibility of statements.

How do you determine the primacy of the American or British English
version? Fallback would be written into the code base, it would not be
amenable to community editing through the wiki.

Whether separate Lexemes or alternative Forms are better might be different
from language to language, from case to case. By hard-coding the
multi-variant lemmas, you not only pre-decided the case, but also made the
code and the data model much more complicated. And not only for the initial
development, but for perpetuity, whenever the data is used.

> What do you think?
>

We shouldn't force for perfection and covering everything from the
beginning. I expect that with the lexical information in the data, Wikidata
will continue to evolve. If not every case can be ideally modeled, but we
can capture 99.9% - well, that's enough to get started, and then see later
how the exceptions will be handled. Also, there is always Wiktionary as the
layer on top of Wikidata that actually can easily resolve these issues
anyway.

Once we have the simple pieces working, we can actually try to understand
where the machinery is creaking and not working well, and then think about
these is

[Wikidata-tech] Two questions about Lexeme Modeling

2016-11-11 Thread Daniel Kinzler

Hi all!

There is two questions about modelling lexemes that are bothering me. One is an
old question, and one I only came across recently.

1) The question that came up for me recently is how we model the grammatical
context for senses. For instance, "to ask" can mean requesting information, or
requesting action, depending on whether we use "ask somebody about" or "ask
somebody to". Similarly, "to shit" has entirely different meanings when used
reflexively ("I shit myself").

There is no good place for this in our current model. The information could be
placed in a statement on the word Sense, but that would be kind of non-obvious,
and would not (at least not easily) allow for a concise rendering, in the way we
see it in most dictionaries ("to ask sbdy to do sthg"). The alternative would be
to treat each usage with a different grammatical context as a separate Lexeme (a
verb phrase Lexeme), so "to shit oneself" would be a separate lemma. That could
lead to a fragmentation of the content in a way that is quite unexpected to
people used to traditional dictionaries.

We could also add this information as a special field in the Sense entity, but I
don't even know what that field should contain, exactly.

Got a better idea?


2) The older question is how we handle different renderings (spellings, scripts)
of the same lexeme. In English we have "color" vs "colour", in German we have
"stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and
Cyrillic rendering for every word. We can treat these as separate Lexemes, but
that would mean duplicating all information about them. We could have a single
Lemma, and represent the others as alternative Forms, or using statements on the
Lexeme. But that raises the question which spelling or script should be the
"main" one, and used in the lemma.

I would prefer to have multi-variant lemmas. They would work like the
multi-lingual labels we have now on items, but restricted to the variants of a
single language. For display, we would apply a similar language fallback
mechanism we now apply when showing labels.

2b) if we treat lemmas as multi-variant, should Forms also be multi-variant, or
should they be per-variant? Should the glosse of a Sense be multi-variant? I
currently tend towards "yes" for all of the above.


What do you think?


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

Re: [Wikidata-tech] Two questions about Lexeme Modeling

[Wikidata-tech] Two questions about Lexeme Modeling

11 matches

Site Navigation

Mail list logo

Footer information