I can confirm that from my experience with DBpedia Spotlight, the bias seems to come from Wikipedia itself.
As a simple exercise, not intended to convince more than to entertain: 230,447 results for organization [1] 75,414 results for organisation [2] Cheers, Pablo [1] http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=~organization&fulltext=Search [2] http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=~organisation&fulltext=Search On Wed, Mar 14, 2012 at 7:39 PM, [email protected] <[email protected]>wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > If you are using DBPedia as a source of enhancement possibilities, I > wonder if that has to do more with a bias in the DBpedia dataset than any > bias in Stanbol? > > - --- > A. Soroka > Software & Systems Engineering :: Online Library Environment > the University of Virginia Library > > On Mar 14, 2012, at 1:20 PM, Mathieu D'Aquin wrote: > > > Hi Rupert, > > > > Thanks for the quick answer and the pointer. > > In summery, if I understand well, it is the enhancer's normal behaviour > to return such entities (e.g., that everybody called Sean will be > recognised as Sean Connery) and the only thing for me to do is to apply > some post processing/filtering. > > > > Would there be some documentation explaining more comprehensively what > kind of filters should be applied for different types of entities? I > noticed for example that the enhancer biased towards american presidents > and american universities. Actually, generally, it is quite biased towards > american things. > > > > Thanks! > > Mathieu. > > > > On 14 Mar 2012, at 12:00, Rupert Westenthaler wrote: > > > >> Hi > >> On 14.03.2012, at 12:25, Mathieu D'Aquin wrote: > >> > >>> Hi All, > >>> > >>> I'm trying to use the enhancer service, currently with the default > settings, but it seems to be behaving rather funnily. > >>> (note that I only care about EntityAnnotation's with references to > dbpedia entities). > >>> > >>> For example, I have tried with the text of the page > >>> http://sssw.org/2012/invited-speakers-tutors/ > >>> > >>> And it gives very weird (even random looking) results, such as "Sean > Connery" or "Nazi Germany". > >>> > >> If you find "Germany" as a location Stanbol will return three suggested > entities. In this case this will be > >> > >> 1. http://dbpedia.org/resource/Germany (confidence: 1704736.125) > >> 2. http://dbpedia.org/resource/Nazi_Germany (confidence: 121766.984) > >> 3. http://dbpedia.org/resource/West_Germany (confidence: 38052.215) > >> > >> (confidence values for the NamedEntityTaggingEngine are the Solr scores > for the used query) > >> > >> I guess this is the reason why you are getting Nazi_Germany as an > suggestion for a lot of pages. > >> > >> For Persons the problem is with cases where OpenNLP NER (Named Entity > Recognition) marks a Person in the text, but only provides the given or > family (e.g. "sean"). In this case the Entity linking will provide you with > the most prominent person in DBpedia with that name - in your case "Sean > Connery". > >> > >> This problem is also described by [STANBOL-320]( > https://issues.apache.org/jira/browse/STANBOL-320). > >> > >>> This weird behaviour is not limited to this page. I have processed > several thousand pages and clearly the results have not been what we would > have expected (very often, for example, it gives us the entity "Jesus" for > no obvious reason). > >>> > >> > >> Jesus is also a "Person" in DBpedia. So I assume that this is similar > to "sean" -> "Sean Connery" > >> > >>> Am I doing something wrong? > >>> Do the default enhancer services need some kind of configuration? > >>> > >> > >> related to this I would suggest to > >> > >> * only consider the suggestion with the highest confidence > >> * ignore TextAnnotations with "dc:type=dbp-ont:Person" if the > "fise:selected-text" property only has a given or family name > >> > >> > >> best > >> Rupert > >> > >>> I have looked at the documentation but couldn't find anything that > seemed to be helpful with this respect. > >>> > >>> Thanks! > >>> Mathieu. > >>> > >>> -- > >>> The Open University is incorporated by Royal Charter (RC 000391), an > exempt charity in England & Wales and a charity registered in Scotland (SC > 038302). > >> > > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJPYOX0AAoJEATpPYSyaoIkckEIAMr+BIkDTgram4Ow7NeEOSxj > K+vSWHStUfaOXnWSj8v6unwDls/yS6H+CZn20rezeLkJZ7nckOc+9TQIcwhbl0yV > LxYsx7NIfiefPKwCGyDH1n8Y4080CspXgWKO5+38pTT5+EjHtU4ienLhDIRjETY7 > +cTh2mQN4fe8VoYgpgl1YQgpafCMmZHwP36ftA3likEO2ZGdOJmPzTpEGR/2A2FQ > kYVZshoX6Y6sjSnD+gCfxwPPliE9Td8tJGxKECmAKn8/JRRaDSsQ9AckN3E3hGEg > 1guc4HHkIRmJcu7wTbJR6gHmXm5zLWtdMHqLxf6z7KYRb3TkwA22erO+WD8PWs0= > =aYov > -----END PGP SIGNATURE----- >
