Re: [Dbp-spotlight-users] How is the confidence value calculated?

David Przybilla Tue, 17 Jun 2014 07:52:27 -0700

Hi all,

Since you use spotlight as a blackbox and you are working on social media,
 I wonder if [1] and [2], could help you work something out, as some sort
of more sophisticated "priors".


@jodeiber :) I was a bit confused about the confidence value affecting this
two things, also worth mentioning that the default value for surfaceforms
is 0.5(I  think) and for topics is (0.1).
I feel tempted to update things in the wiki but just find it a bit weird
that github pushes wiki changes without any proper revisions.


[1] http://basekb.com/subjectiveEye/wikipedia_traffic_page_counts.php
[2] https://github.com/paulhoule/telepath/wiki/SubjectiveEye3D


On Tue, Jun 17, 2014 at 2:13 PM, Radim Rehurek <[email protected]>
wrote:

> Hello Jo,
>
>
> The statistical model selects the best annotations by maximizing the
> probability that the entity could have generated a particular surface form
> and the context. The final model score for each entity is of the form
> P(surface form, context, entity) = P(sf | e) P(context | e) P(e).
>
>
> thanks for the clarification. Is it possible to add this probability (~its
> logarithm) to the REST output as well? Next to `support` and
> `similarityScore` etc. I'd like to experiment with it.
>
>
> I'll try to do a pull request, but it will take me time. I'm hoping
> someone already well-versed in the current codebase can add it much more
> easily/quickly.
>
>
> Best,
>
> Radim
>
>
>
>
>
>
>
> This probability will be very small. At the moment, the similarity score
> is calculated using the softmax (as David explained), which basically means
> the similarity score is expressed as roughly the ratio of the probability
> of the current entity to that of all entities together. This does not tell
> you much about the real confidence however, just about how "sure" a
> disambiguation is (if there is only 1 candidate for a surface form, this
> value will always be 1.0; however, there could still be an error in the
> candidate mapping).
>
> Passing the confidence=x.x parameter will filter both surface forms and
> entities. These are basically two different parameters. I would prefer to
> have them separately, but did not separate them because I didn't want to
> touch the web interface. The REST module is a real mess at the moment that
> could easily be re-written in a much simpler way.
>
>
> Hope that helps,
> Jo
>
>
>
> On Tue, Jun 17, 2014 at 11:37 AM, Radim Rehurek <[email protected]>
> wrote:
>
> Thanks David.
>
> If I understand your reply correctly, you're advocating using
> "similarityScore" directly as Spotlight's detection confidence.
>
> I wonder if this is better than Stefano's formula. Stefano, did you
> evaluate your formula somehow? Mixing support into the confidence formula
> makes good sense to me too.
>
> Best,
> Radim
>
>
>
>
>
> ---------- Původní zpráva ----------
> Od: David Przybilla <[email protected]>
> Komu: Radim Rehurek <[email protected]>
> Datum: 17. 6. 2014 11:06:01
> Předmět: Re: [Dbp-spotlight-users] How is the confidence value calculated?
>
> Hi Radim, Stefano,
>
> 1. This is roughly how I think it works, best to confirm checking some
> code/paper:
>
> So the support you give via the endpoint serve as a filter over how many
> annotated counts an entity should have.
>
> The confidence value you give via the endpoint is used twice:
>
>  - To filter spots ( chunks of surfaceforms which will be matched later to
> a topic)
>  - To Filter topic annotations (once you have disambiguated) ( secondRank
> Filter is also used in this stage)
>
>
> Similarity_of_t = ln(surfaceForm Prior ) + ln(prior_of_t) +
> contextSimilarity_for_t
> softTotalSimilarity = sum(e ^ Similarity_of_i)
> final_similarity_of_t  = e ^(Similarity_of_t - softTotalSimilarity)
>
>
> -- order the topics by similarity(greaterFirst
> secondRank =  e ^(bottomTopicFinalSimilarityScore -
> topTopicFinalSimilarityScore)
>
> topics with : secondRank > (1 - confidence ^2) are filtered
>
>
> 2. what is the best value ?
>
> I  think this really depends on your use-case, for example if you need
> lots of general topics you might want to have a low value, however be
> prepared for a wave of dodgy topics and surface forms annotations as well.
>
> If you are doing social-media most likely you have lots of surface forms
> and variations of them which are not getting spotted because of the
> confidence value.
> My advice is to empirically adjust the confidence and support value and
> then tweak the spotlight model to adapt it to your particular use case [1]
>
> [1] https://github.com/idio/spotlight-model-editor
>
>
>
> On Mon, Jun 16, 2014 at 5:32 PM, Radim Rehurek <[email protected]>
> wrote:
>
> I would be also extremely interested in an answer to this. Thanks for
> asking, Stefano.
>
> What's the best way to calculate "Spotlight's detection confidence" = a
> single number?
>
> Cheers,
> Radim
>
> ---------- Původní zpráva ----------
> Od: Stefano Bocconi <[email protected]>
> Komu: [email protected] <
> [email protected]>
> Datum: 16. 6. 2014 18:14:26
> Předmět: [Dbp-spotlight-users] How is the confidence value calculated?
>
> Hi,
>
>  I am new to this list, I came here from the github Spotlight page about
> support and feedback. Questions related to what I am asking have popped up
> a couple of times in this list as far as I can see, but the answers do not
> provide what I am looking for.
>
>  I am using the statistical back-end, and I am basically trying to
> reconstruct the confidence value of the entities extracted.
>
>  I have extracted entities from tweets and as a first experiment I did
> not asked for any threshold confidence. Now I would like to calculate the
> confidence of each results to see how filtering based on that influences
> the quality of some other process I am doing with the entities.
>
>  I am now using the formula:
>
>  (1 - .5 * percentageofsecondrank) * similarityscore
>
>  Based on the fact that confidence increases with similarity score, but
> decreases if the second candidate is also similar.
>
>  Is this comparable to what Spotlight uses in
> http://spotlight.dbpedia.org/rest/annotate? Or else what is the formula?
> Does support play a role?
>
>  Thanks,
>
>  Stefano
>
>
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>
> http://p.sf.net/sfu/hpccsystems_______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>
> http://p.sf.net/sfu/hpccsystems_______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems

_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Re: [Dbp-spotlight-users] How is the confidence value calculated?

Reply via email to