Sorry, Just to clear things out: The demo endpoint you are using uses spotlight 0.7. Spotlight 0.7 use statistical models, but you can also use statistical models with spotlight 0.6. For my use case the statistical models with spotlight 0.6 gives me back much less noise, and usually very relevant stuff. So my suggestion is to also give a try to spotlight 0.6 with statistical models.
On Fri, Aug 15, 2014 at 3:23 PM, Betsey Benagh < [email protected]> wrote: > I found that the statistical endpoint gave me even more noise, actually. > > Your comments about types confused me more - my question about that is > really this: what do I put into the 'types' parameter? > > An example result I don't want to see: > {"@URI":"http://dbpedia.org/resource/Hypertext_Transfer_Protocol > ","@offset":"153","@percentageOfSecondRank":"-1.0","@similarityScore":"0.05355825275182724","@support":"248","@surfaceForm":"http","@types":"Freebase:/internet/protocol,Freebase:/internet,Freebase:/computer/internet_protocol,Freebase:/computer,Freebase:/internet/api"} > surfaceForm: http > URI: http://dbpedia.org/resource/Hypertext_Transfer_Protocol > > Would I blacklist "Freebase:/internet" and "Freebase:/computer"? Those > don't appear in the ontology pointed to by the documentation. > > > > > Betsey Benagh > > Boston Fusion Corp. > 1 Van de Graaff Drive, Ste 107 > Burlington, MA 01803-5176 > [email protected] > 617-583-5730 x106 (office) > 781-367-6720 (mobile) > > > > > On Fri, Aug 15, 2014 at 10:15 AM, David Przybilla <[email protected] > > wrote: > >> Hi Betsey, >> >> One thing to take into account is that currently the "confidence" value >> passed as a parameter affects both the spotter and the disambiguation. So >> whenever you set the threshold too low, you will spot a lot of irrelevant >> things, which then get passed to the disambiguator which will also have a >> low filter threshold. >> >> There are another few reasons and issues (check github) why you might not >> be getting interesting entities with high confidence values, as an example >> ( using your sample-text ) you might not be getting the entitity "Barack >> Obama" from the surfaceForm 'Obama" due to the discount mechanism used >> when generating the models, so since "Obama" is a sub-sequence of other >> surface forms such as "Barack Obama" it might have got discounted enough to >> have a very little probability. >> >> I encourage you to give a try to spotlight 0.6 (statistical version). >> Depending on your use case you might get less noise, but you might need >> more memory/processing power. >> Models and jar are available here [1] >> >> I've got no idea about querying dbpedia or how dbpedia is structured, but >> given a dbpedia_id you can match it to a freebase_id and ask freebase if >> the current entity is either a type '/people/person' or '/time/event' . >> Within the types you mention you should have no problems, since freebase >> has a very good coverage of people, events and locations. >> >> [1] http://spotlight.sztaki.hu/downloads/version-0.1/ >> >> >> >> >> >> On Fri, Aug 15, 2014 at 2:29 PM, Betsey Benagh < >> [email protected]> wrote: >> >>> Thanks to everyone for the help yesterday with the statistical endpoint. >>> >>> >>> I'm trying to understand how to tune the tool to get optimal results. >>> >>> When I used the example text in the demo interface - >>> >>> President Obama called Wednesday on Congress to extend a tax break >>> for students included in last year's economic stimulus package, arguing >>> that the policy provides more generous assistance. >>> >>> A confidence of 0.5 only picked up 'Congress'. Reducing the confidence >>> to 0.3 picked up a lot more stuff - including linking 'Wednesday' to a >>> sports team, which seems bizarre to me. >>> >>> On my own data, which comes from Twitter, I see weird things like >>> mentions of 'police' linking to the musical group The Police, and the word >>> 'celebrate' (in the context of celebrating an anniversary) linking to the >>> Madonna song. If I turn the confidence up, I lose those references, but I >>> also lose 'good' references as well. >>> >>> I feel like whitelisting or blacklisting is the way to go, but I'm >>> having trouble correlating the types I see in my results with the ontology >>> at http://mappings.dbpedia.org/server/ontology/classes/ That ontology >>> particularly confuses me, as it seems very uneven - as an example, under >>> 'Organization', there are classes that make sense to me, like 'Company' and >>> "Sports League', and then oddly specific things like 'Comedy Group' and >>> 'Samba School' at the same level. In my results, there are a mix of types >>> from DBpedia, Schema, and Freebase, and it's not clear to me how I would >>> specify (for example) that I'm interested in people, places, and events, >>> but not musical groups, internet concepts (it always picks up 'http' from >>> embedded links and gives me 'Hypertext Transfer Protocol'), etc. >>> >>> Thanks! >>> >>> Betsey Benagh >>> >>> Boston Fusion Corp. >>> 1 Van de Graaff Drive, Ste 107 >>> Burlington, MA 01803-5176 >>> [email protected] >>> 617-583-5730 x106 (office) >>> 781-367-6720 (mobile) >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Dbp-spotlight-users mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users >>> >>> >> >
------------------------------------------------------------------------------
_______________________________________________ Dbp-spotlight-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
