Re: [Wikidata-l] Data values

Daniel Kinzler Wed, 19 Dec 2012 03:14:31 -0800

On 19.12.2012 08:53, Gregor Hagedorn wrote:
>> Displaying the numbers is another question. There I have to agree that it
>> always makes sense to also store a typical used unit for that type of data.
> 
> I agree. What I propose is that the user interface supports entering
> and proofreading "10.6 nm" as "10.6" plus "n" (= nano) plus "meter".


Yes, absolutely.

> How the value is stored in the data property, whether as 10.6 floating
> point or as 1.6e-8 is a second issue -- the latter is probably
> preferable. 

I think neither is sufficient: we need a representation that allows for
arbitrary (or at least very great) precision, and can still be indexed and
compared natively by (different!) database systems. Fixed length strings can
easily do that, if they are long enough. That's pretty inefficient, though.

IEEE floats work natively, but don't guarantee enough precision (well, maybe 128
bit floats come close?). The SQL, "decimal" might be sufficient: in MySQL, it
allows 30 decimal digits before the decimal point, and up to 64 after. But
that's still not enough to measure the extent of the universe in Plancks.

> In addition to a storage option of the desired unit prefix (this may
> be considered a original-prefix, since naturally re-users may wish to
> reformat this).

I see no point in storing the unit used for input.

> it is probably necessary to store the number of
> significant decimals.

That's how Denny proposed to calculate the default accuracy. If the accuracy is
given by a complex model (e.g. a gamma distribution), then it might be handy to
have a simple value that tells us the significant digits.

Hm... perhaps it's best to always express accuracy as "+/-n", and allow for more
detailed information (standard deviation, whatever) as *additional* information
about the accuracy (could be modelled as a qualifier internally).

> I believe in the user interface this needs not
> be any visible setting, simply the number of digits can be preserved.
> Without these is impossible to store and reproduce information  like
> "10.20 nm", it would be returned as 1.02 10^-8 m. 

No, it would return using whatever system of measurement the user has selected
in their preferences.

> Complex heuristic
> may "guess" when to use the scientific SI prefixes instead. The
> trailing zero cannot be reproduced however when completely relying on
> IEEE floating-point.

We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The
general rule could be to pick a unit so that the actual value is between 1 and
10, with some additional rules for dealing with cultural specialities (decimeter
is rarely used, hectoliter however is pretty common. The decagram is commonly
used in Austria only, etc).

Note that for rendering of values in infoboxes, the desired unit and precision
can always be given explicitly.

Note "precision" vs "accuracy" here: the precision controls how many digits are
shown, while the accuracy indicates how exact our knowledge is. The Precision
can be derived from the accuracy and vice versa, using appropriate heuristics.
But they are not the same. IMHO, the accuracy should always be stored with the
value, the precision never.

-- daniel

-- 
Daniel Kinzler, Softwarearchitekt
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.


_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Data values

Reply via email to