On 01/07/14 22:43, Bene* wrote:
Am 01.07.2014 22:23, schrieb Markus Krötzsch:
P.S. One weakness of my algorithm you can already see: it has troubles
estimating the relevance of very rare properties, such as "Minor
Planet Center observatory code" above. A single wrong annotation may
then lead to wrong suggestions. Also, it seems from my list under (2)
that some Grade I listed buildings are ships. This seems to be an
error that is amplified by the fact that property "masts" is used only
11 times in the dataset I evaluated (last week's data). I guess the
new property suggester rather errs on the other side, being tricked
into suggesting very frequent properties even in places that don't
need them.
However, it is obviously better if the algorithm performs well for
frequently used properties. Isn't it possible to combine those two
systems so they improve each other. One could check how often the
property is used and then rely on Markus' or the students' algorithm.

My hope is that with my other suggestion (using P31 values as features to correlate with), the property suggester will already be able to outperform my little toy algorithm anyway. One could also combine the two (my algorithm is really simple [1]), but maybe this is not needed.

Cheers

Markus

[1] For each class C and property P, I count:

* #C: the number of items in class C
* #P: the number of items using property P
* #PC: the number of items in class C using the property P
* #items: the total number of items

Then I compute two rates:

* rateCP = #PC / #C (fraction of items in a class with the property)
* rateP = #P / #items (fraction of all items with the property)

I then rank the properties for each class by the ratio of rateCP/rateP (intuitively: by what factor does the property of P increase for items in C?). Moreover, I apply two sigmoid functions [2] to the rates as additional factors, so as to ensure that properties are less "relevant" if they have very high or very low values for the rates. I don't care about things that almost everything/almost nothing has. Obviously, one can tweak this if one wants to include properties that "almost everything" has anyway.

[2] https://www.google.com/search?sclient=psy-ab&q=1+%2F+%281+%2B+exp%286+*+%28-2+*+x+%2B+0.5%29%29%29&btnG=

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Reply via email to