Hi Betsey, One thing to take into account is that currently the "confidence" value passed as a parameter affects both the spotter and the disambiguation. So whenever you set the threshold too low, you will spot a lot of irrelevant things, which then get passed to the disambiguator which will also have a low filter threshold.
There are another few reasons and issues (check github) why you might not be getting interesting entities with high confidence values, as an example ( using your sample-text ) you might not be getting the entitity "Barack Obama" from the surfaceForm 'Obama" due to the discount mechanism used when generating the models, so since "Obama" is a sub-sequence of other surface forms such as "Barack Obama" it might have got discounted enough to have a very little probability. I encourage you to give a try to spotlight 0.6 (statistical version). Depending on your use case you might get less noise, but you might need more memory/processing power. Models and jar are available here [1] I've got no idea about querying dbpedia or how dbpedia is structured, but given a dbpedia_id you can match it to a freebase_id and ask freebase if the current entity is either a type '/people/person' or '/time/event' . Within the types you mention you should have no problems, since freebase has a very good coverage of people, events and locations. [1] http://spotlight.sztaki.hu/downloads/version-0.1/ On Fri, Aug 15, 2014 at 2:29 PM, Betsey Benagh < [email protected]> wrote: > Thanks to everyone for the help yesterday with the statistical endpoint. > > I'm trying to understand how to tune the tool to get optimal results. > > When I used the example text in the demo interface - > > President Obama called Wednesday on Congress to extend a tax break > for students included in last year's economic stimulus package, arguing > that the policy provides more generous assistance. > > A confidence of 0.5 only picked up 'Congress'. Reducing the confidence to > 0.3 picked up a lot more stuff - including linking 'Wednesday' to a sports > team, which seems bizarre to me. > > On my own data, which comes from Twitter, I see weird things like mentions > of 'police' linking to the musical group The Police, and the word > 'celebrate' (in the context of celebrating an anniversary) linking to the > Madonna song. If I turn the confidence up, I lose those references, but I > also lose 'good' references as well. > > I feel like whitelisting or blacklisting is the way to go, but I'm having > trouble correlating the types I see in my results with the ontology at > http://mappings.dbpedia.org/server/ontology/classes/ That ontology > particularly confuses me, as it seems very uneven - as an example, under > 'Organization', there are classes that make sense to me, like 'Company' and > "Sports League', and then oddly specific things like 'Comedy Group' and > 'Samba School' at the same level. In my results, there are a mix of types > from DBpedia, Schema, and Freebase, and it's not clear to me how I would > specify (for example) that I'm interested in people, places, and events, > but not musical groups, internet concepts (it always picks up 'http' from > embedded links and gives me 'Hypertext Transfer Protocol'), etc. > > Thanks! > > Betsey Benagh > > Boston Fusion Corp. > 1 Van de Graaff Drive, Ste 107 > Burlington, MA 01803-5176 > [email protected] > 617-583-5730 x106 (office) > 781-367-6720 (mobile) > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Dbp-spotlight-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users > >
------------------------------------------------------------------------------
_______________________________________________ Dbp-spotlight-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
