Re: Simple question about the enhancer service

Rupert Westenthaler Wed, 14 Mar 2012 22:15:42 -0700

Hallo

In short: the number of incoming links is used to boost documents (entities) in 
the Solr index.


For all who are interested in here are the details about ranking of DBpedia 
entities ...

(1) As Pablo assumed correctly we use the number of incoming links to rank 
entities [1]

curl $DBPEDIA/en/page_links_en.nt.bz2 \
        | bzcat \
        | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
        | sort -S $MAX_SORT_MEM \
        | uniq -c  \
        | sort -nr -S $MAX_SORT_MEM > 
$WORKSPACE/indexing/resources/incoming_links.txt

(2) This numbers are normalized by using the natural logarithm - 
Math.log1p(incommungCount)
(3) Such values are than normalized within the range [0..1]
(4) The resulting value is used as Solr Document Boost when creating the 
SolrIndex
(5) Note also the FieldBoosts used by default for DBPedia [2]

Steps 2 - 4 are configured by the "scoreNormalizer" property of the 
indexing.properties [3]:
   * MinScoreNormalizer is executed first and can be used to control the number 
of indexed entities (see minincoming.properties)
   * NaturalLogNormaliser see (2)
   * RangeNormaliser see (3)

best
Rupert

[1] 
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh
[2] 
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/fieldboosts.properties
[3] 
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/indexing.properties

On 14.03.2012, at 21:56, Mathieu D'Aquin wrote:

> Sure, wikipedia is a lot more populated with american things than others. 
> What is unclear to me however, is how the enhancer gets to choose "Sean 
> Connery" as the universal representative of all the Seans in the world and by 
> extension how I can recognise when it is wrong.
> 
> I understand that, directly or indirectly, the enhancer would favour common 
> entities. I'm just unsure how it is evaluated that an entity is more common 
> than another.
> 
> Has there been any evaluation of the results of the enhancer that could show 
> this bias? 
> 
> Thanks, 
> Mathieu.
> 
> On 14 Mar 2012, at 20:00, Pablo Mendes wrote:
> 
>> I can confirm that from my experience with DBpedia Spotlight, the bias
>> seems to come from Wikipedia itself.
>> 
>> As a simple exercise, not intended to convince more than to entertain:
>> 230,447 results for organization [1]
>> 75,414 results for organisation [2]
>> 
>> Cheers,
>> Pablo
>> [1]
>> http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=~organization&fulltext=Search
>> [2]
>> http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=~organisation&fulltext=Search
>> 
>> 
>> On Wed, Mar 14, 2012 at 7:39 PM, [email protected] 
>> <[email protected]>wrote:
>> 
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>> 
>>> If you are using DBPedia as a source of enhancement possibilities, I
>>> wonder if that has to do more with a bias in the DBpedia dataset than any
>>> bias in Stanbol?
>>> 
>>> - ---
>>> A. Soroka
>>> Software & Systems Engineering :: Online Library Environment
>>> the University of Virginia Library
>>> 
>>> On Mar 14, 2012, at 1:20 PM, Mathieu D'Aquin wrote:
>>> 
>>>> Hi Rupert,
>>>> 
>>>> Thanks for the quick answer and the pointer.
>>>> In summery, if I understand well, it is the enhancer's normal behaviour
>>> to return such entities (e.g., that everybody called Sean will be
>>> recognised as Sean Connery) and the only thing for me to do is to apply
>>> some post processing/filtering.
>>>> 
>>>> Would there be some documentation explaining more comprehensively what
>>> kind of filters should be applied for different types of entities? I
>>> noticed for example that the enhancer biased towards american presidents
>>> and american universities. Actually, generally, it is quite biased towards
>>> american things.
>>>> 
>>>> Thanks!
>>>> Mathieu.
>>>> 
>>>> On 14 Mar 2012, at 12:00, Rupert Westenthaler wrote:
>>>> 
>>>>> Hi
>>>>> On 14.03.2012, at 12:25, Mathieu D'Aquin wrote:
>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> I'm trying to use the enhancer service, currently with the default
>>> settings, but it seems to be behaving rather funnily.
>>>>>> (note that I only care about EntityAnnotation's with references to
>>> dbpedia entities).
>>>>>> 
>>>>>> For example, I have tried with the text of the page
>>>>>> http://sssw.org/2012/invited-speakers-tutors/
>>>>>> 
>>>>>> And it gives very weird (even random looking) results, such as "Sean
>>> Connery" or "Nazi Germany".
>>>>>> 
>>>>> If you find "Germany" as a location Stanbol will return three suggested
>>> entities. In this case this will be
>>>>> 
>>>>> 1. http://dbpedia.org/resource/Germany (confidence: 1704736.125)
>>>>> 2. http://dbpedia.org/resource/Nazi_Germany (confidence: 121766.984)
>>>>> 3. http://dbpedia.org/resource/West_Germany (confidence: 38052.215)
>>>>> 
>>>>> (confidence values for the NamedEntityTaggingEngine are the Solr scores
>>> for the used query)
>>>>> 
>>>>> I guess this is the reason why you are getting Nazi_Germany as an
>>> suggestion for a lot of pages.
>>>>> 
>>>>> For Persons the problem is with cases where OpenNLP NER (Named Entity
>>> Recognition) marks a Person in the text, but only provides the given or
>>> family (e.g. "sean"). In this case the Entity linking will provide you with
>>> the most prominent person in DBpedia with that name - in your case "Sean
>>> Connery".
>>>>> 
>>>>> This problem is also described by [STANBOL-320](
>>> https://issues.apache.org/jira/browse/STANBOL-320).
>>>>> 
>>>>>> This weird behaviour is not limited to this page. I have processed
>>> several thousand pages and clearly the results have not been what we would
>>> have expected (very often, for example, it gives us the entity "Jesus" for
>>> no obvious reason).
>>>>>> 
>>>>> 
>>>>> Jesus is also a "Person" in DBpedia. So I assume that this is similar
>>> to "sean" -> "Sean Connery"
>>>>> 
>>>>>> Am I doing something wrong?
>>>>>> Do the default enhancer services need some kind of configuration?
>>>>>> 
>>>>> 
>>>>> related to this I would suggest to
>>>>> 
>>>>> * only consider the suggestion with the highest confidence
>>>>> * ignore TextAnnotations with "dc:type=dbp-ont:Person" if the
>>> "fise:selected-text" property only has a given or family name
>>>>> 
>>>>> 
>>>>> best
>>>>> Rupert
>>>>> 
>>>>>> I have looked at the documentation but couldn't find anything that
>>> seemed to be helpful with this respect.
>>>>>> 
>>>>>> Thanks!
>>>>>> Mathieu.
>>>>>> 
>>>>>> --
>>>>>> The Open University is incorporated by Royal Charter (RC 000391), an
>>> exempt charity in England & Wales and a charity registered in Scotland (SC
>>> 038302).
>>>>> 
>>>> 
>>> 
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
>>> Comment: GPGTools - http://gpgtools.org
>>> 
>>> iQEcBAEBAgAGBQJPYOX0AAoJEATpPYSyaoIkckEIAMr+BIkDTgram4Ow7NeEOSxj
>>> K+vSWHStUfaOXnWSj8v6unwDls/yS6H+CZn20rezeLkJZ7nckOc+9TQIcwhbl0yV
>>> LxYsx7NIfiefPKwCGyDH1n8Y4080CspXgWKO5+38pTT5+EjHtU4ienLhDIRjETY7
>>> +cTh2mQN4fe8VoYgpgl1YQgpafCMmZHwP36ftA3likEO2ZGdOJmPzTpEGR/2A2FQ
>>> kYVZshoX6Y6sjSnD+gCfxwPPliE9Td8tJGxKECmAKn8/JRRaDSsQ9AckN3E3hGEg
>>> 1guc4HHkIRmJcu7wTbJR6gHmXm5zLWtdMHqLxf6z7KYRb3TkwA22erO+WD8PWs0=
>>> =aYov
>>> -----END PGP SIGNATURE-----
>>> 
>

Re: Simple question about the enhancer service

Reply via email to