Hi, Although I cannot answer for the Stanbol enhancer specifically, I can report my experience.
Please see inline. On Wed, Mar 14, 2012 at 9:56 PM, Mathieu D'Aquin <[email protected]>wrote: > Sure, wikipedia is a lot more populated with american things than others. > What is unclear to me however, is how the enhancer gets to choose "Sean > Connery" as the universal representative of all the Seans in the world and > by extension how I can recognise when it is wrong. > Presumably because there are more links in Wikipedia to Sean_Connery than to any other page of a Sean. > I understand that, directly or indirectly, the enhancer would favour > common entities. I'm just unsure how it is evaluated that an entity is more > common than another. > Number of links in Wikipedia is commonly used as a prior probability estimate for entities. > Has there been any evaluation of the results of the enhancer that could > show this bias? > State of the art entity linkers include such a prior as one of the components in their disambiguation algorithms. The trick is to find the right bias. There is some preliminary analysis by Fader et al [1], and you will see the feature appearing also at TAC-KBP-2011. We have proposed to perform evaluations of different enhancement chains within the EAP. Analyzing the impact of this bias would certainly be one of the evaluation points. > > Thanks, > Mathieu. > > On 14 Mar 2012, at 20:00, Pablo Mendes wrote: > > > I can confirm that from my experience with DBpedia Spotlight, the bias > > seems to come from Wikipedia itself. > > > > As a simple exercise, not intended to convince more than to entertain: > > 230,447 results for organization [1] > > 75,414 results for organisation [2] > > > > Cheers, > > Pablo > > [1] > > > http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=~organization&fulltext=Search > > [2] > > > http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=~organisation&fulltext=Search > > > > > > On Wed, Mar 14, 2012 at 7:39 PM, [email protected] <[email protected] > >wrote: > > > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> If you are using DBPedia as a source of enhancement possibilities, I > >> wonder if that has to do more with a bias in the DBpedia dataset than > any > >> bias in Stanbol? > >> > >> - --- > >> A. Soroka > >> Software & Systems Engineering :: Online Library Environment > >> the University of Virginia Library > >> > >> On Mar 14, 2012, at 1:20 PM, Mathieu D'Aquin wrote: > >> > >>> Hi Rupert, > >>> > >>> Thanks for the quick answer and the pointer. > >>> In summery, if I understand well, it is the enhancer's normal behaviour > >> to return such entities (e.g., that everybody called Sean will be > >> recognised as Sean Connery) and the only thing for me to do is to apply > >> some post processing/filtering. > >>> > >>> Would there be some documentation explaining more comprehensively what > >> kind of filters should be applied for different types of entities? I > >> noticed for example that the enhancer biased towards american presidents > >> and american universities. Actually, generally, it is quite biased > towards > >> american things. > >>> > >>> Thanks! > >>> Mathieu. > >>> > >>> On 14 Mar 2012, at 12:00, Rupert Westenthaler wrote: > >>> > >>>> Hi > >>>> On 14.03.2012, at 12:25, Mathieu D'Aquin wrote: > >>>> > >>>>> Hi All, > >>>>> > >>>>> I'm trying to use the enhancer service, currently with the default > >> settings, but it seems to be behaving rather funnily. > >>>>> (note that I only care about EntityAnnotation's with references to > >> dbpedia entities). > >>>>> > >>>>> For example, I have tried with the text of the page > >>>>> http://sssw.org/2012/invited-speakers-tutors/ > >>>>> > >>>>> And it gives very weird (even random looking) results, such as "Sean > >> Connery" or "Nazi Germany". > >>>>> > >>>> If you find "Germany" as a location Stanbol will return three > suggested > >> entities. In this case this will be > >>>> > >>>> 1. http://dbpedia.org/resource/Germany (confidence: 1704736.125) > >>>> 2. http://dbpedia.org/resource/Nazi_Germany (confidence: 121766.984) > >>>> 3. http://dbpedia.org/resource/West_Germany (confidence: 38052.215) > >>>> > >>>> (confidence values for the NamedEntityTaggingEngine are the Solr > scores > >> for the used query) > >>>> > >>>> I guess this is the reason why you are getting Nazi_Germany as an > >> suggestion for a lot of pages. > >>>> > >>>> For Persons the problem is with cases where OpenNLP NER (Named Entity > >> Recognition) marks a Person in the text, but only provides the given or > >> family (e.g. "sean"). In this case the Entity linking will provide you > with > >> the most prominent person in DBpedia with that name - in your case "Sean > >> Connery". > >>>> > >>>> This problem is also described by [STANBOL-320]( > >> https://issues.apache.org/jira/browse/STANBOL-320). > >>>> > >>>>> This weird behaviour is not limited to this page. I have processed > >> several thousand pages and clearly the results have not been what we > would > >> have expected (very often, for example, it gives us the entity "Jesus" > for > >> no obvious reason). > >>>>> > >>>> > >>>> Jesus is also a "Person" in DBpedia. So I assume that this is similar > >> to "sean" -> "Sean Connery" > >>>> > >>>>> Am I doing something wrong? > >>>>> Do the default enhancer services need some kind of configuration? > >>>>> > >>>> > >>>> related to this I would suggest to > >>>> > >>>> * only consider the suggestion with the highest confidence > >>>> * ignore TextAnnotations with "dc:type=dbp-ont:Person" if the > >> "fise:selected-text" property only has a given or family name > >>>> > >>>> > >>>> best > >>>> Rupert > >>>> > >>>>> I have looked at the documentation but couldn't find anything that > >> seemed to be helpful with this respect. > >>>>> > >>>>> Thanks! > >>>>> Mathieu. > >>>>> > >>>>> -- > >>>>> The Open University is incorporated by Royal Charter (RC 000391), an > >> exempt charity in England & Wales and a charity registered in Scotland > (SC > >> 038302). > >>>> > >>> > >> > >> -----BEGIN PGP SIGNATURE----- > >> Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > >> Comment: GPGTools - http://gpgtools.org > >> > >> iQEcBAEBAgAGBQJPYOX0AAoJEATpPYSyaoIkckEIAMr+BIkDTgram4Ow7NeEOSxj > >> K+vSWHStUfaOXnWSj8v6unwDls/yS6H+CZn20rezeLkJZ7nckOc+9TQIcwhbl0yV > >> LxYsx7NIfiefPKwCGyDH1n8Y4080CspXgWKO5+38pTT5+EjHtU4ienLhDIRjETY7 > >> +cTh2mQN4fe8VoYgpgl1YQgpafCMmZHwP36ftA3likEO2ZGdOJmPzTpEGR/2A2FQ > >> kYVZshoX6Y6sjSnD+gCfxwPPliE9Td8tJGxKECmAKn8/JRRaDSsQ9AckN3E3hGEg > >> 1guc4HHkIRmJcu7wTbJR6gHmXm5zLWtdMHqLxf6z7KYRb3TkwA22erO+WD8PWs0= > >> =aYov > >> -----END PGP SIGNATURE----- > >> > >
