On 29/05/14 13:53, Thomas Douillard wrote:
hehe, maybe some kind inferences can lead to a good heuristic to suggest
properties and values in the entity suggester. As they naturally become
"softer" and "softer" by combination of uncertainties, this could also
provide some kind of limits for inferences by fixing a probability below
which we don't add a fuzzy fact to the set of facts.
Maybe we could fix an heuristic starting fuzziness or probability score
based on "1 sourced claim" -> big score ; one disputed claim ; based on
ranks and so on.
Sorry, I have to expand on this a bit ...
My main point was that there are many fuzzy logics (depending on the
t-norm you chose) and many probabilistic logics (depending on the
stochastic assumptions you make). The meaning of a score crucially
depends on which logic you are in. Moreover, at least in fuzzy logic,
the scores only are relevant in comparison to other scores (there is no
absolute meaning to "0.3") -- therefore you need to ensure that the
scores are assigned in a globally consistent way (0.3 in Wikidata would
have to mean exactly the same wherever it is used).
This makes it extremely hard to implement such an approach in practice
in a large, distributed knowledge base like ours. What's more, you
cannot find these scores in books or newspapers, so you somehow have to
make them up in another way. You suggested to use this for statements
that are not generally accepted, but how do you measure "how disputed" a
statement is? If two thirds of references are for it and the rest is
against it, do you assign 0.66 as a score? It's very tricky.
Fuzzy logic has its main use in fuzzy control (the famous "washing
machine" example), which is completely different and largely unrelated
to fuzzy knowledge representation. In knowledge representation, fuzzy
approaches are also studied, but their application is usually in a
closed system (e.g., if you have one system that extracts data from a
text and assigns "certainties" to all extracted facts in the same way).
It's still unclear how to choose the right logic, but at least it will
give you a uniform treatment of your data according to some fixed
principles (whether they make sense or not).
The situation is much clearer in probabilistic logics, where you define
your assumptions first (e.g., you assume that events are independent or
that dependencies are captured in some specific way). This makes it more
rigorous, but also harder to apply, since in practice these assumptions
rarely hold. This is somewhat tolerable if you have a rather uniform
data set (e.g., a lot of sensor measurements that give you some
probability for actual states of the underlying system). But if you have
a huge, open, cross-domain system like Wikidata, it would be almost
impossible to force it into a particular probability framework where
"0.3" really means "in 30% of all cases".
Also note that scientific probability is always a limit of observed
frequencies. It says: if you do something again and again, this is the
rate you will get. Often-heard statements like "We have an 80% chance to
succeed!" or "Chances are almost zero that the Earth will blow up
tomorrow!" are scientifically pointless, since you cannot repeat the
experiments that they claim to make statements about. Many things we
have in Wikidata are much more on the level of such general statements
than on the level that you normally use probability for (good example of
a proper use of probability: "based the tests that we did so far, this
patient has a 35% chance of having cancer" -- these are not the things
we normally have in Wikidata).
Markus
2014-05-29 13:43 GMT+02:00 Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>:
On 29/05/14 12:41, Thomas Douillard wrote:
@David:
I think you should have a look to fuzzy logic
<https://www.wikidata.org/__wiki/Q224821
<https://www.wikidata.org/wiki/Q224821>>:)
Or at probabilistic logic, possibilistic logic, epistemic logic, ...
it's endless. Let's first complete the data we are sure of before we
start to discuss whether Pluto is a planet with fuzzy degree 0.6 or
0.7 ;-)
(The problem with quantitative logics is that there is usually no
reference for the numbers you need there, so they are not well
suited for a secondary data collection like Wikidata that relies on
other sources. The closest concept that still might work is
probabilistic logic, since you can really get some probabilities
from published data; but even there it is hard to use the
probability as a raw value without specifying very clearly what the
experiment looked like.)
Markus
_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l