Re: [Wikidata-l] Data values

jmcclure Fri, 21 Dec 2012 09:48:46 -0800

 

(if i knew the private email for Denny, I'd send this there)



"Martynas, there is no mention here of XSD etc. because it is not
relevant on this level of discussion. For exporting the data we will
obviously use XSD datatypes. This is so obvious that I didn't think it
needed to be explicitly stated." 

Maybe you don;'t realize snarky
comments such as the last sentence are infinitely tiresome. It certainly
demonstrates your lack of understanding what Martynas is saying: build
on others work to the extent that you can and, if you can't, explain why
not. 

On 21.12.2012 09:14, Denny Vrandečić wrote: 

> Hi all, 
> 
>
wow! Thanks for all the input. I read it all through, and am trying to
digest it currently into a new draft of the data model for the discussed
data values. I will try to adress some questions here. Please be kind if
I refer the wrong person at one place or the other. 
> 
> Whenever I
refer to the "current model", I mean the version as it was during this
discussion
<http://meta.wikimedia.org/w/index.php?title=Wikidata/Development/Representing_values&oldid=4859586
[2]> 
> 
> The term "updated model" refers to the new one, which is not
published yet. I hope I can do that soon. 
> 
> == General comments ==

> 
> I want to remind everyone of the Wikidata requirements:
<http://meta.wikimedia.org/wiki/Wikidata/Notes/Requirements [3]> 
> 
>
Here especially: 
> * The expressiveness of Wikidata will be limited.
There will always be examples of knowledge that Wikidata will not be
able to convey. We hope that this expressiveness can increase over time.

> * The first goal of Wikidata is to serve actual use cases in
Wikipedia, not to enable some form of hypothetical perfection in
knowledge representation. 
> * Wikidata has to balance ease of use and
expressiveness of statements. The user interface should not get
complicated to merely cover a few exceptional edge cases. 
> * What is
an exceptional case, and what is not, will be defined by how often they
appear in Wikipedia. Instead of anecdotal evidence or hypothetical
examples we will analyse Wikipedia and see how frequent specific cases
are. 
> 
> In general this means that we cannot express everything that
is expressible. A statement should not be intended to reflect the source
as close as possible, but rather to be *supported* by the source. I.e.
if the source says "He died during the early days of 1876" this would
also support a statement like "died in - 19th century". It does not have
to be more exact than that. 
> 
> Martynas, there is no mention here of
XSD etc. because it is not relevant on this level of discussion. For
exporting the data we will obviously use XSD datatypes. This is so
obvious that I didn't think it needed to be explicitly stated. 
> 
>
Tom, thanks for the links to EDTF and the Freebase work, this was
certainly very enlightening. 
> 
> Friedrich, the term "query answering"
simply means the ability to answer queries against the database in Phase
3, e.g. the list of cities located in Ghana with a population over
25,000 ordered by population. 
> 
> A query system that deals well with
intervals -- I would need a pointer for that. For now I was always
assuming to use a single value internally to answer such queries. If the
values is 90+-20 then the query >100? would not contain that result.
Sucks, but I don't know of any better system. 
> 
> We do not anywhere
rely on floats (besides in internal representations), but always use
decimals. Floats have some inherent problems in representing some
numbers that could be interesting for us. 
> 
> == Time == 
> 
> Marco
suggested to N/A some values of dates. This is partially the idea of the
"precision" attribute in the current data. Anything below the precision
would be N/A. It would not be possible to N/A the year when the month or
day is known though, as Friedrich suggested. 
> 
> Friedrich also
suggested to use a value like April-July 1567 for uncertain time instead
of the current precision model. I prefer his suggestion to the current
one and will include that in the updated model. 
> 
> The accuracy
though has to be in the unit given by the precision, we cannot just take
seconds, since there is no well-defined number of seconds in a month or
a year, or, almost anything, actually. 
> 
> Note though that the
intervals that Sven mentioned -- useful for e.g. reigns or office
periods -- are different beasts and should have uncertainty entries both
for the start and end date. We have intervals in the data model, and
plan to implement them later -- it is just that they are not such a high
priority (dates appear 2.5 Million times in infoboxes, intervals only
80,000 times). 
> 
> I am completely unsure what to do with a value like
"about 1850" if not to interpret it at as something like 1850 +- 50, but
Sven seems to dislike that. 
> 
> == Location == 
> 
> After the
discussion, I decided to drop altitude / elevation from the Geolocation.
It can still be expressed through a property, and have all the
flexibility of a normal property (including qualifiers etc.) 
> 
> In a
Geolocation, neither the lat nor the long is optional (sorry Nikola).
The Geolocation as a whole can be optional, though (i.e. unknown), but
not only one of them. 
> 
> For the geolocations uncertainty I would
like to use the same uncertainty model as for Quantity values and now
for time. I know that "meters" have been suggested instead of degrees,
but that would be kind of ugh considering that the biggest reason why we
need the uncertainty is for converting units, in this case from decimals
to degree-minute-seconds. 
> 
> == Quantity values == 
> 
> Sorry to
disagree with Daniel here, but we will definitively store a quantity
value in the unit that the editor used for input. We will then
internally normalize it for indexing etc., but the editor won't be
bothered with that as long as they do not ask for a conversion. Storing
it with the original unit is important for a number of reasons, most of
which Gregor already alluded to. 
> 
> I very much like Gregor's
suggestion: rename the lower uncertainty and upper uncertainty to
something with less semantic baggage. What about upper and lower bound?
Or just upper and lower? And then leave the interpretation to others. 
>

> Gregor, an infinitively precise number (the number of apostles, e.g.)
would be handled trivially by +- 0. 
> 
> Also I am taking the hint from
Avenue and others and drop confidence. I don't think it is useful to
have it so deeply embedded in the data model, and should properly be
handled through qualifiers. 
> 
> Regarding the height of the Eiffel
tower: 324 m +- 1m is exactly what I would like to see here if the
source states 324 meter. 
> I know the source doesn't say +-1m, but this
is certainly supported by the source. Think about why we need this +-1m:
it is simply so we can give a useful transformation into feet. Otherwise
we cannot convert units. 
> The +-1m would not be displayed usually. 
>

> == Units == 
> 
> I sense consensus that we should allow declaration
of units in the wiki, and not to have it hardcoded in the software.
Having discussed the various options and in light of the discussion
here, the current suggestion would be to create a page for every
quantity unit including the appropriate factors (for linear
translations). This is similar to the way Freebase does it, as sent
around by Tom, and what John McClure suggested. 
> 
> Then on a given
property, the property points to a quantity unit and furthermore lists
the "usual units" for the given property (pointing to the given items),
which is used for display. 
> 
> Internally, for indexing, sorting, and
query answering, we would always transform the input to the quantity
unit so they are comparable. But this is usually neither exposed nor a
useful number (e.g. it might have too many significant digits etc.) 
>

> This would allow to use historic units like Li or historic miles even
though we do not know how to translate them to other units (but not by
the same property). 
> 
> This would also allow for other units, like
Avenue has pointed out. Those are important. 
> 
> Nikola, we will not
have special handling for money for now. This would require a whole
different spec I am afraid. Currency happen 200,000 times in Wikipedia
-- it is often, but not so often to be high priority. 
> 
> I hope that
I managed to digest the whole discussion and bring it together. 
> 
>
Cheers, 
> Denny 
> 
> _______________________________________________
>
Wikidata-l mailing list
> Wikidata-l@lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1]




Links:
------
[1]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[2]
http://meta.wikimedia.org/w/index.php?title=Wikidata/Development/Representing_values&amp;oldid=4859586
[3]
http://meta.wikimedia.org/wiki/Wikidata/Notes/Requirements

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Data values

Reply via email to