Hi David,
With confidence I mean that an entity with score 1 is always correct or almost,
with .5 the half of the times, etc. So it indicates how sure you are that the
entity is really in the text, and if you really only need to be sure for your
application, you will put threshold close to 1.
I would like to have that value because I was curious to see how varying that
value the quality of the profiles I am generating changes. Profiles describe in
this case users that have produced the text.
Regards,
Stefano
From: David Przybilla <[email protected]<mailto:[email protected]>>
Date: Friday 20 June 2014 16:41
To: Stefano Bocconi <[email protected]<mailto:[email protected]>>
Cc: Radim Rehurek <[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [Dbp-spotlight-users] How is the confidence value calculated?
Hi Stefano,
Yes, In the calculations I refer to the second candidate.
I'm a bit confused, what do you mean with "confidence"?
There is a spotlight branch, we have been working on with Relevance Scores.
The Relevance Score is related to the importance of an entity in the text, I
wonder if this is somehow interesting for what you are doing.
David
On Fri, Jun 20, 2014 at 3:18 PM, Stefano Bocconi
<[email protected]<mailto:[email protected]>> wrote:
Hi Radim,
I have not evaluated my formula, I thought it was a more or less logical
combination of how sure Spotlight was of the first candidate and whether the
second candidate was also at the same level.
In David’s explanation I think that the second rank is not calculated with the
bottom entity, but with the second best. So if I am very sure of the first
candidate, but also of the second, I get a 50% change that either is true.
Now it turns out that this is not “confidence”, but “disambiguation”, so
Spotlight could be very sure to have disambiguated the wrong entity, but it
seems there is not concept of confidence at the moment in Spotlight.
As I wrote in another mail, the candidates REST function has some more info for
each candidate, so there are more parameters to study, unfortunately I doubt I
will have the time to do so now.
In any case I have to manually evaluate something like 500 tweets, I might
reuse this corpus in the future to correlate it to the other parameters.
Regards,
Stefano
From: Radim Rehurek <[email protected]<mailto:[email protected]>>
Date: Tuesday 17 June 2014 11:37
To: David Przybilla <[email protected]<mailto:[email protected]>>
Cc: Stefano Bocconi <[email protected]<mailto:[email protected]>>,
"[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: [Dbp-spotlight-users] How is the confidence value calculated?
Thanks David.
If I understand your reply correctly, you're advocating using "similarityScore"
directly as Spotlight's detection confidence.
I wonder if this is better than Stefano's formula. Stefano, did you evaluate
your formula somehow? Mixing support into the confidence formula makes good
sense to me too.
Best,
Radim
---------- Původní zpráva ----------
Od: David Przybilla <[email protected]<mailto:[email protected]>>
Komu: Radim Rehurek <[email protected]<mailto:[email protected]>>
Datum: 17. 6. 2014 11:06:01
Předmět: Re: [Dbp-spotlight-users] How is the confidence value calculated?
Hi Radim, Stefano,
1. This is roughly how I think it works, best to confirm checking some
code/paper:
So the support you give via the endpoint serve as a filter over how many
annotated counts an entity should have.
The confidence value you give via the endpoint is used twice:
- To filter spots ( chunks of surfaceforms which will be matched later to a
topic)
- To Filter topic annotations (once you have disambiguated) ( secondRank
Filter is also used in this stage)
Similarity_of_t = ln(surfaceForm Prior ) + ln(prior_of_t) +
contextSimilarity_for_t
softTotalSimilarity = sum(e ^ Similarity_of_i)
final_similarity_of_t = e ^(Similarity_of_t - softTotalSimilarity)
-- order the topics by similarity(greaterFirst
secondRank = e ^(bottomTopicFinalSimilarityScore -
topTopicFinalSimilarityScore)
topics with : secondRank > (1 - confidence ^2) are filtered
2. what is the best value ?
I think this really depends on your use-case, for example if you need lots of
general topics you might want to have a low value, however be prepared for a
wave of dodgy topics and surface forms annotations as well.
If you are doing social-media most likely you have lots of surface forms and
variations of them which are not getting spotted because of the confidence
value.
My advice is to empirically adjust the confidence and support value and then
tweak the spotlight model to adapt it to your particular use case [1]
[1] https://github.com/idio/spotlight-model-editor
On Mon, Jun 16, 2014 at 5:32 PM, Radim Rehurek
<[email protected]<mailto:[email protected]>> wrote:
I would be also extremely interested in an answer to this. Thanks for asking,
Stefano.
What's the best way to calculate "Spotlight's detection confidence" = a single
number?
Cheers,
Radim
---------- Původní zpráva ----------
Od: Stefano Bocconi <[email protected]<mailto:[email protected]>>
Komu:
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Datum: 16. 6. 2014 18:14:26
Předmět: [Dbp-spotlight-users] How is the confidence value calculated?
Hi,
I am new to this list, I came here from the github Spotlight page about support
and feedback. Questions related to what I am asking have popped up a couple of
times in this list as far as I can see, but the answers do not provide what I
am looking for.
I am using the statistical back-end, and I am basically trying to reconstruct
the confidence value of the entities extracted.
I have extracted entities from tweets and as a first experiment I did not asked
for any threshold confidence. Now I would like to calculate the confidence of
each results to see how filtering based on that influences the quality of some
other process I am doing with the entities.
I am now using the formula:
(1 - .5 * percentageofsecondrank) * similarityscore
Based on the fact that confidence increases with similarity score, but
decreases if the second candidate is also similar.
Is this comparable to what Spotlight uses in
http://spotlight.dbpedia.org/rest/annotate? Or else what is the formula? Does
support play a role?
Thanks,
Stefano
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems_______________________________________________
Dbp-spotlight-users mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users