Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread David Cuenca Tudela
> If we want to avoid this complexity, we could just go by prefix. So if the
> languages is "de", variants like "de-CH" or "de-DE_old" would be
considered ok.
> Ordering these alphabetically would put the "main" code (with no suffix)
first.
> May be ok for a start.

I find this issue potentially controversial, and I think that the community
at large should be involved in this matter to avoid future dissatisfaction
and to promote involvement in the decision-making.

For languages there are regulatory bodies that assign codes, but for
varieties it is not the case, or at least not totally. Even under the en-gb
there are many varieties and dialects
https://en.wikipedia.org/wiki/List_of_dialects_of_the_English_language#United_Kingdom

In my opinion it would be more appropriate to use standardized language
codes, and then specify the dialect with an item, as it provides greater
flexibility. However, as mentioned before I would prefer if this topic in
particular would be discussed with wiktionarians.


Thanks for moving this forward!

David



On Fri, Nov 25, 2016 at 11:45 AM, Daniel Kinzler <
daniel.kinz...@wikimedia.de> wrote:

> Thank you Denny for having an open mind! And sorry for being a nuisance ;)
>
> I think it's very important to have controversial but constructive
> discussions
> about these things. Data models are very hard to change even slightly once
> people have started to create and use the data. We need to try hard to get
> it as
> right as possible off the bat.
>
> Some remarks inline below.
>
> Am 25.11.2016 um 03:32 schrieb Denny Vrandečić:
> > There is one thing that worries me about the multi-lemma approach, and
> that are
> > mentions of a discussion about ordering. If possible, I would suggest
> not to
> > have ordering in every single Lexeme or even Form, but rather to use the
> > following solution:
> >
> > If I understand it correctly, we won't let every Lexeme have every
> arbitrary
> > language anyway, right? Instead we will, for each language that has
> variants
> > have somewhere in the configurations an explicit list of these variants,
> i.e.
> > say, for English it will be US, British, etc., for Portuguese Brazilian
> and
> > Portuguese, etc.
>
> That approach is similar to what we are now doing for sorting Statement
> groups
> on Items. There is a global ordering of properties defined on a wiki page.
> So
> the community can still fight over it, but only in one place :) We can
> re-order
> based on user preference using a Gadget.
>
> For the multi-variant lemmas, we need to declare the Lexeme's language
> separately, in addition to the language code associated with each lemma
> variant.
> It seems like the language will probably represented as reference to a
> Wikidata
> Item (that is, a Q-Id). That Item can be associated with an (ordered) list
> of
> matching language codes, via Statements on the Item, or via configuration
> (or,
> like we do for unit conversion, configuration generated from Statements on
> Items).
>
> If we want to avoid this complexity, we could just go by prefix. So if the
> languages is "de", variants like "de-CH" or "de-DE_old" would be
> considered ok.
> Ordering these alphabetically would put the "main" code (with no suffix)
> first.
> May be ok for a start.
>
> I'm not sure yet on what level we want to enforce the restriction on
> language
> codes. We can do it just before saving new data (the "validation" step),
> or we
> could treat it as a community enforced soft constraint. I'm tending
> towards the
> former, though.
>
> > Given that, we can in that very same place also define their ordering
> and their
> > fallbacks.
>
> Well, all lemmas would fall back on each other, the question is just which
> ones
> should be preferred. Simple heuristic: prefer the shortest language code.
> Or go
> by what MediaWiki does fro the UI (which is what we do for Item labels).
>
> > The upside is that it seems that this very same solution could also be
> used for
> > languages with different scripts, like Serbian, Kazakh, and Uzbek
> (although it
> > would not cover the problems with Chinese, but that wasn't solved
> previously
> > either - so the situation is strictly better). (It doesn't really solve
> all
> > problems - there is a reason why ISO treats language variants and scripts
> > independently - but it improves on the vast majority of the problematic
> cases).
>
> Yes, it's not the only decision we have to make in this regard, but the
> most
> fundamental one, I think.
>
> One consequence of this is that Forms should probably also allow multiple
> representations/spellings. This is for consistency with the lemma, for code
> re-use, and for compatibility with Lemon.
>
> > So, given that we drop any local ordering in the UI and API, I think that
> > staying close to Lemon and choosing a TermList seems currently like the
> most
> > promising approach to me, and I changed my mind.
>
> Knowing that you won't do that without a good reason, I thank you for the
> 

Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-22 Thread David Cuenca Tudela
> There are many many words with multiple spellings, but not many words
with more than two, and few with more than three [citation needed].

That is not true in languages with a high amount of dialects. For instance
in Catalan there are 5 standard spellings for "carrot" depending on which
dialect you choose, plus some more if you consider local variations:
https://ca.wikipedia.org/wiki/Pastanaga

But that is nothing compared to the 8 spellings of tomato or more if you
count the local variations:
https://ca.wikipedia.org/wiki/Tom%C3%A0quet

Additionally the same form can have different meanings depending on which
dialect you choose. For instance "pastenaga" means "orange carrot" in
Catalan from Catalonia, and "purple carrot" in Catalan from Valencia.

Which makes me think, how dialects will be handled? Statements?

This is an example of a dialect map:
https://ca.wikipedia.org/wiki/Dialectes_del_catal%C3%A0#Divisi.C3.B3_dialectal

Regards and thanks for elaborating your long answer,
-d



On Mon, Nov 21, 2016 at 5:45 PM, Daniel Kinzler <daniel.kinz...@wikimedia.de
> wrote:

> Hi all!
>
> Sorry for the delay. To keep the conversation in one place, I will reply to
> David, Denny, and Philipp in one mail. It's going to be a bit long,
> sorry...
>
> Am 11.11.2016 um 23:17 schrieb David Cuenca Tudela:
> > Hi there!
> >
> > 1) a possible solution could be to have another category of items
> ("Gxxx",
> > grammatical rule?) to store grammatical structures, like "Noun + verb +
> object"
> > or "Noun + reflexive verb" and then linking to that structure with a
> qualifier
> > of the position that it uses on that structure. Example:
> > "to shit"  "Subject + reflexive verb +
> reflexive
> > pronoun"
> >  "reflexive verb"
>
> I see no need for a separate entity type, this could be done with a regular
> Item. If we want this to work nicely for display, though, the software
> would
> need to know about some "magic" properties and their meaning. Since
> Wikidata
> provides a stable global vocabulary, it would not be terrible to hard-code
> this.
> But still, it's special case code...
>
> This is pretty similar to Lemon's "Syntactic Frame" that Philipp pointed
> out,
> see below.
>
> > 2) I would prefer statements as they can be complemented with qualifiers
> as for
> > why it has a certain spelling (geographical variant, old usage,
> corruption...).
>
> You can always use a statement for this kind of information, just as we do
> now
> on Wikidata with properties for the surname or official name.
>
> The question is how often the flexibility of a statement is really needed.
> If
> it's not too often, it would be ok to require both (the lemma and the
> statement)
> to be entered separately, as we do now for official name, birth name, etc.
>
> Another question is which (multi-term lemma or secondary
> lemma-in-a-statement)
> is easier to handle by a 3rd party consumer. More about that later.
>
> > It would be nice however if there would be some mechanism to have a
> special kind
> > of property that would use its value as an item alias. And this is
> something
> > that could benefit normal items in Wikidata too, as most name properties
> like
> > P1448, P1477 (official name, birth name, etc), should have its value
> > automatically show as alias of the item in all languages, if that were
> > technologically feasible.
>
> Yes, this would be very convenient. But it would also mix levels of content
> (editorial vs. sourced) that are now nicely separated. I'm very tempted,
> but I'm
> not sure it's worth it.
>
> Am 12.11.2016 um 00:08 schrieb Denny Vrandečić:
> > Not only that. "I shit myself" is very different from "Don't shit
> yourself".
> > It is not just the reflexivity. It might the whole phrase.
>
> Yes, the boundary to a phrase is not clear cut. But if we need the full
> power of
> modeling as a phrase, we can always do that by creating a separate Lexeme
> for
> the phrase. The question is if that should be the preferred or even the
> only way
> to model the "syntactic frame".
>
> It's typical for a dictionary to have a list of meanings structured like
> this:
>
>   to ask
>   to ask so. sth.
>   to ask so. for sth.
>   to ask so. about sth.
>   to ask so. after sb.
>   to ask so. out
>   ...
>
> It would be nice if we had an easy way to create such an overview. If each
> line
> is modeled as a separate Lexeme, we need to decide how these Lexemes
> should be
> connected to allow such an overview.
>
> I feel