Hi Betsey,

One thing to take into account is that currently the "confidence" value
passed as a parameter affects both the spotter and the disambiguation. So
whenever you set the threshold too low,  you will spot a lot of irrelevant
things, which then get passed to the disambiguator which will also have a
low filter threshold.

There are another few reasons and issues (check github) why you might not
be getting interesting entities with high confidence values, as an example
( using your sample-text ) you might not be getting  the entitity "Barack
Obama" from the surfaceForm 'Obama"  due to the discount mechanism used
when generating the models, so since "Obama" is a sub-sequence of other
surface forms such as "Barack Obama" it might have got discounted enough to
have a very little probability.

I encourage you to give a try to spotlight 0.6 (statistical version).
Depending on your use case you might get less noise, but you might need
more memory/processing power.
Models and jar are available here [1]

I've got no idea about querying dbpedia or how dbpedia is structured, but
given a dbpedia_id you can match it to a freebase_id and ask freebase if
the current entity is either a type '/people/person' or '/time/event' .
Within the types you mention you should have no problems, since freebase
has a very good coverage of people, events and locations.

[1] http://spotlight.sztaki.hu/downloads/version-0.1/





On Fri, Aug 15, 2014 at 2:29 PM, Betsey Benagh <
[email protected]> wrote:

> Thanks to everyone for the help yesterday with the statistical endpoint.
>
> I'm trying to understand how to tune the tool to get optimal results.
>
> When I used the example text in the demo interface -
>
> President Obama called Wednesday on Congress to extend a tax break
>   for students included in last year's economic stimulus package, arguing
>   that the policy provides more generous assistance.
>
> A confidence of 0.5 only picked up 'Congress'.  Reducing the confidence to
> 0.3 picked up a lot more stuff - including linking 'Wednesday' to a sports
> team, which seems bizarre to me.
>
> On my own data, which comes from Twitter, I see weird things like mentions
> of 'police' linking to the musical group The Police, and the word
> 'celebrate' (in the context of celebrating an anniversary) linking to the
> Madonna song.  If I turn the confidence up, I lose those references, but I
> also lose 'good' references as well.
>
> I feel like whitelisting or blacklisting is the way to go, but I'm having
> trouble correlating the types I see in my results with the ontology at
> http://mappings.dbpedia.org/server/ontology/classes/  That ontology
> particularly confuses me, as it seems very uneven - as an example, under
> 'Organization', there are classes that make sense to me, like 'Company' and
> "Sports League', and then oddly specific things like 'Comedy Group' and
> 'Samba School' at the same level.  In my results, there are a mix of types
> from DBpedia, Schema, and Freebase, and it's not clear to me how I would
> specify (for example) that I'm interested in people, places, and events,
> but not musical groups, internet concepts (it always picks up 'http' from
> embedded links and gives me 'Hypertext Transfer Protocol'), etc.
>
> Thanks!
>
> Betsey Benagh
>
> Boston Fusion Corp.
> 1 Van de Graaff Drive, Ste 107
> Burlington, MA 01803-5176
> [email protected]
> 617-583-5730 x106 (office)
> 781-367-6720 (mobile)
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Dbp-spotlight-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
>
>
------------------------------------------------------------------------------
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to