Re: [Wikidata-l] Data values

Gregor Hagedorn Thu, 20 Dec 2012 11:40:18 -0800

> So, please suggest terms to use for at least these two things:
>
> 1) value certainty (ideally, not using "digits", but something that is
> independent of unit and rendering)

Here we want to talk about something that the true value is with a
certain probability within a given interval, something like: "2.3
+/-0.2 µm"

I am not too sure here myself. Different terms exist whether you talk
about an inherent measurement error of a single individual with a
single true value, or whether you speak of statistical measures or
estimates.

Marco gives yet another example: "We want to specify the "limits of
(possible) variation" of a value, which would be Engineering
tolerance. E.g. the value of electrical resistances, capacitors, etc.
are measured in Ω ± % or F ± %. We could also either use/allow/display
absolute or relative values." -- In this case, it is actually not a
uncertainty of the actual sample of resistors, but a design
specification, i.e. the specification that resistors must be (all or
only 95%?) within _at least_ these limits.

So what to do here?

List the different use cases of a value plus-minus other values?
* measurement-method limited precision range of single measurements
(e.g.small structures in light microscope, limited by resolution
capability of blue light, approx. 0.2 µm)
* measurement-method limited accuracy range (or accuracy plus precision)
* Confidence interval for mean (or other statistical parameters: mode,
variance, etc.) of the population as estimated based on a sample
* one of potentially several percentiles (incl. +- s.d.) measuring
spread, but giving no information about the probability that the true
mean is between these values
* engineering design specifications that a given (unknown) fraction of
individuals must be within these limits
I believe for the moment you don't want to go into certainty in the
sense that a number is an estimate of a

All these different concepts have rightly so different names. There can be:
* precision +/- 0.2
* accuracy +/- 0.2
* tolerance +/- 0.2
* error margin +/- 0.2
* +/- 1 or 2 s.d. +/- 0.2
* 95% confidence interval (CI) +/- 0.2
* 10 to 90% percentile  +/- 0.2
* uncertainty (of what?) +/- 0.2

(ASIDE: the +/1 2 s.d. defines roughly a 95% probability that the next
value from a random sample is in the interval, the 95% CI that the
true value of the mean is in that interval. These are completely
different things -- for the same measurements you can report validly
100 +/- 50 for the first and 100 +-0.001 for the second. That is, with
probability 95% the next randomly sampled measurement will be between
50 and 150, and with probability 95% it is known that the true mean is
between 99.999 and 100.001. Semantic matters, not only the "pattern"
of plus-minus a value.)

Because of the widely varying use cases listed above, I believe we
need very neutral labels for the plus-minus values if the data type
shall simple provide two "variables" in a generic sense, the true
semantics of which are then provided by qualifier information.

I could think of something:
* lower range (lowerRange) and upper range (upperRange).
* lower/upper interval value/endpoint
but I don't very much like this because it would force people to
abandon the plus/minus notation and calculate actual values.

Better may be something like:
* upwardsAbsolute
* downwardsAbsolute
* upwardsPercent
* downwardsPercent
or
* plusValueAbsolute
* minusValueAbsolute
* plusValuePercent
* minusValuePercent
*
as neutral terms - but I would be glad if someone comes up with other
neutral terms.

However, I hope we start realizing that all of us seem to look at this
primarily from only one of the use cases listed above (me included, I
usually have cases with variance spread or CI of mean). We should stop
using terms that are specific to one but not the other of the cases.
The assumption "these things are all more or less the same" is not
true. A confidence interval is neither a manufacturing tolerance nor a
measurement precision. And precision is not accuracy, etc.

> 2) output exactness (here, the number of digits is actually what we want to 
> talk
> about)

xsd:totalDigits or Wikipedia: significantDigits or significantFigures

that is one way to express value exactness, albeit a course on.

Marco writes: "Everywhere in the realm of software development
precision is used for this. Therefore also here the suggestion of
precision was not that bad."

-> In software development, the term is about the precision of the
numeric data type, i.e. the precision of the storage mechanism. The
term precision is correctly applied here. However, we talk about the
actually significant digits of a measurement, which are part of the
potential information on precision and accuracy of the value. The
measured value with e.g. 6 digits may be stored in a data type which
has a precision of 16 digits. I think applying "precision" to
significant digits is and produces a fundamental misunderstand of what
precision is, see the Wikipedia topic on precision and accuracy.

Gregor

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Data values

Reply via email to