Re: [Wikidata] Preferred rank -- choices for infoboxes, versus SPARQL

Markus Krötzsch Fri, 27 Nov 2015 08:27:06 -0800

On 27.11.2015 17:05, Tobias Schönberg wrote:

@Markus, James:
In my opinion it is better to make the query ask for the most recent
population number. People just need to start using time-qualifiers for
things like census-report numbers.

Unfortunately, this is not sufficient for census number selections,since the most recent number might be less accurate than anothersomewhat-recent number, which is therefore considered "preferred". Ihave no idea how to come up with a reasonable SPARQL query to evaluatethis situation.

Similarly, ignoring the instance-of statements that are historic ifother statements may have no times associated whatsoever, and pickingthe most recent instance-of statement if all of them have timesassociated would require an amount of computation that you really don'twant to encode in SPARQL. Feel free to prove me wrong by posting theSPARQL query here, but I think it won't be feasible. SPARQL is not aprogramming language to implement arbitrarily complex selection rulesin. The current rank-based system, in spite of its necessarylimitations, is in fact highly effective for solving a huge number ofsuch issues in a pragmatic way. You may need to use the exact data formany applications (we completely agree there), but ranks will always beof great use to keep the rest of your query as simple as possible.


And the other issue is one of standardized vocabulary and that is always
a sourcing problem in my opinion. A query could say "get the
instance-of-statement" that has a supporting source from the Spanish
Geographic Society. Then the infobox would only include standardized
vocabulary by that organization. But I aknowledge that large parts of
the world are not covered by standardized vocabulary organizations.

Yes, it seems we need to let the use of references evolve a little moreuntil such things will be feasible and lead to good coverage.


If that doesn't solve it we could at least think about language specific
rank-overrides.

Storing ranks per language will not be feasible or desirable. I thinkthe solutions I gave can go a long way. In the end, anylanguage-specific way to define the classes you want to display/hidewill do. For example, a SPARQL query for all super classes that have anarticle in a given Wikipedia is rather easy (querying for the mostspecific such superclasses is another matter of course ...).


Markus

2015-11-27 16:41 GMT+01:00 Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>:

    Hi James,

    I would immediately agree to the following measures to alleviate
    your problem:

    (1) If some instance-of statements are historic (i.e., no longer
    valid), then one should make the current ones "preferred" and leave
    the historic ones "normal", just like for, e.g., population numbers.
    This would get rid of the rather inappropriate "Free imperial city"
    label for Frankfurt.

    (2) If some classes are redundant, they could be removed (e.g., if
    we already have "Big city" we do not need "city"). However,
    community might decide to prefer the direct use of a main class
    (such as "Human"), even if redundant.

    The other issues you mention are more tricky. Especially issues of
    translation/cultural specificity. The most specific classes are not
    always the ones that all languages would want to see, e.g., if the
    concept of the class is not known in that language.

    Possible options for solving your problem:

    * Make a whitelist of classes you want to show at all in the
    template, and default to "city" if none of them occurs.
    * Make a blacklist of classes you want to hide.
    * Instead of blacklist or whitelist, show only classes that have a
    Wikipedia page in your language; default to "city" if there are none.
    * Try to generalise overly specific classes (change "big city" to
    "city" etc.). I don't know if there is a good programmatic approach
    for this, or if you would have to make a substitution list or
    something, which would not be very maintainable.
    * Do not use instance-of information like this in the infobox. It
    might sound radical, but I am not sure if "instance of" is really
    working very well for labelling things in the way you expect.
    Instance-of can refer to many orthogonal properties of an object, in
    essentially random order, while a label should probably focus on
    certain aspects only.

    For obvious reasons, ranks of statements cannot be used to record
    language-specific preferences.

    Cheers,

    Markus


    On 27.11.2015 15:58, James Heald wrote:

        Some items have quite a lot of "instance of" statements,
        connecting them
        to quite a few different classes.

        For example, Frankfurt is currently an instance of seven
        different classes,
        https://www.wikidata.org/wiki/Q1794

        and Glasgow is currently an instance of five different classes:
        https://www.wikidata.org/wiki/Q4093

        This can produce quite a pile-up of descriptions in the
        description/subtitle section of an infobox -- for example, as on the
        Spanish page for Frankfurt at
        https://es.wikipedia.org/wiki/Fr%C3%A1ncfort_del_Meno
        in the section between the infobox title and the picture.


        Question:

        Is it an appropriate use of ranking, to choose a few of the
        values to
        display, and set those values to be "preferred rank" ?

        It would be useful to have wider input, as to whether it is a
        good thing
        as to whether this is done widely.

        Discussions are open at
        
https://www.wikidata.org/wiki/Wikidata:Project_chat#Preferred_and_normal_rank

        and
        
https://www.wikidata.org/wiki/Wikidata:Bistro#Rang_pr.C3.A9f.C3.A9r.C3.A9

        -- but these have so far been inconclusive, and have got
        slightly taken
        over by questions such as

        * how well terms really do map from one language to another --
        near-equivalences that may be near enough for sitelinks may be
        jarring
        or insufficient when presented boldly up-front in an infobox.

        (For example, the French translation "ville" is rather
        unspecific, and
        perhaps inadequate in what it conveys, compared to "city" in
        English or
        "ciudad" in Spanish; "town" in English (which might have over
        100,000
        inhabitants) doesn't necessarily match "bourg" in French or
        "Kleinstadt"
        in German).

        * whether different-language wikis may seek different degrees of
        generalisation or specificity in such sub-title areas, depending
        on how
        "close" the subject is to that wiki.

        (For readers in some languages, some fine distinctions may be highly
        relevant and familiar, whereas for other language groups that
        level of
        detail may be undesirably obscure).


        There is also the question of the effect of promoting some values to
        "preferred rank" for the visibility of other values in SPARQL -- in
        particular when so queries are written assuming they can get
        away with
        using just the simple "truthy" wdt:... form of properties.

        However, making eg the value "city" preferred for Glasgow means
        that it
        will no longer be returned in searches for its other values, if
        these
        have been written using "wdt:..." -- so it will now be missed in a
        simple-level query for "council areas", the current top-level
        administrative subdivisions of Scotland, or for historically-based
        "registration counties" -- and this problem will become more
        pronounced
        if the practice becomes more widespread of making some values
        "preferred" (and so other values invisible, at least for queries
        using
        wdt:...).

          From a SPARQL point of view, what would actually be very
        helpful would
        to add a (new) fourth rank -- "misleading without qualifier", below
        "normal" but above "deprecated" -- for statements that *are*
        true (with
        the qualifiers), but could be misleading without them
        * for example, for a town that was the county town of a shire
        once, but
        hasn't been for two centuries
        * or for an administrative area that is partly located in one
        higher-level division, and partly in another -- this is very
        valuable
        information to be able to note, but it's important to be able to
        exclude
        it from being all included in a recursive search for the places
        in one
        (but not the other) of that higher-level division.

        The statements shouldn't be marked "deprecated", because they
        are true
        (unlike a widely-given but incorrect date of birth, for
        example).  At
        the moment one can sort of work round the issue, if one can find
        another
        statement to make "preferred", so that the qualified statement
        becomes
        invisible to a simple search without qualifiers.  However, if
        "preferred" status is going to be used just to select things to
        show in
        infoboxes, it becomes very desirable that "wdt:..." searches should
        retrieve things at normal rank as well -- creating a need for a
        new rank
        for statements which are true, but misleading if read without
        qualifiers.


        What *is* needed though, is a view on whether trying to tailor
        what is
        shown in infoboxes is an appropriate reason to alter statement
        rankings.

        It would be good to get a view on this.

        The Spanish guys who stated doing this have temporarily put further
        rank-changes on hold, for the issue to be discussed; but so far what
        they have done has only just scratched the surface of what could
        be done
        -- there are still a lot more cases of multiple values they
        would like
        to tidy.

        So: is this the kind of thing that "preferred rank" is envisaged
        for ?

        Or, should some statements not be marked as less preferred than
        others,
        if this is the only reason ?


             --  James.


        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata



    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata




_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Preferred rank -- choices for infoboxes, versus SPARQL

Reply via email to