Re: Confused about linking engines

Rupert Westenthaler Tue, 21 May 2013 05:07:42 -0700

On Tue, May 21, 2013 at 12:56 PM, Luigi Selmi <[email protected]> wrote:
> Hello Reto and Rupert,
>
> I was looking at the same components and some more things that should be
> put in a clearer way are the sites (Referenced/Managed) and the Yards
> (Solr/Clerezza) that can be used for linking and interlinking. I write here
> my understanding about the components currently provided by Stanbol for the
> linking and interlinking tasks to be sure it is correct (or not) and also
> some questions.
>
> In Working with Custom
> Vocabularies<http://stanbol.apache.org/docs/trunk/customvocabulary.html>it
> is said that a Referenced or a Managed site can be used for linking.
> Both must be based on a Solr yard so that it will be possible to do keyword
> search. It should be clear but it must be underlined that they must be used
> with RDF datasets if one wants to look for entities using keywords. While
> it is possible to add (and get indexed) new RDF triples to a Managed site
> the same cannot be done with a Referenced site that once has been built
> with a proper tool, as explained in the same page cited above, cannot be
> updated in the same way. In order to use these sites for linking and
> interlinking an enhancement engine (EntiyhubLinkingEngine or
> NamedEntityTaggingEngine) must be configured in the Felix console providing
> the identifier of the site to use to search entities (URI) to link to. In
> the configuration panels only Referenced Sites are mentioned but also
> managed Sites based on Solr Yard should work (?).
>
> The first type of engines compare tokens in the text that arrived to the
> engine, eventually through a chain which has a tokenizer before the linking
> engine, with the value of the rdfs:label property of the target RDF data
> indexed within the site to look for entities (subject URIs of the
> rdfs:label property). The result of the comparison is ranked and added to
> the contentitem metadata and finally sent to the client.
>
> The second type of linking engine (NamedEntityTagging) uses the result of a
> NER process. This means that it can be used only in a chain where a NER
> engine is provided before it. Currently can be configured only to work with
> person, organizations and places because only models with these types of
> entities are available in Stanbol. The NER engines look for entities of
> those type within text and is configured to use some well known URI for the
> types mentioned, for example http://dbpedia.org/ontology/Person for person.
> The result of the NER process is put in the contentitem's metadata and used
> by the next engine for interlinking that will use only the rdfs:label
> property attached to entities of those types (e.g.
> http://dbpedia.org/ontology/Person) for comparison.
>
> One second issue on this architecture, after the one about the use of
> Managed sites in the linking engines configuration panels instead of
> Referenced ones, is about doing the interlinking with the RDF data
> extracted from documents and stored in the content graph or in other graphs
> based on the Clerezza Yard. The only way to use these graph seems to be
> making a copy of the graph and store the data in a Solr yard to be used in
> a Managed/Referenced site.
>
> As the documentation about Managed and Referenced sites is quite good even
> if it lacks some details the same cannot be said about the Entityhub. It is
> not very clear if it is just an interface of all the sites (managed and
> referenced) or there is something more.
>
> To sum up the main points are:
>
> 1) is it possible to use a managed site instead of a referenced one in the
> linking engine configuration panels (both types) ?


Yes any Site (Managed or Referenced) can be used.

The Usage scenario was written before ManagedSites even existed.
Because of that it still states "ReferencedSite" on some places where
it should note "any Site".

> 2) which is the best way to do interlinking with RDF data in a graph within
> Stanbol with the current components ? Only the one I mentioned or there are
> other options ?

You can configure an EntityLinking or NamedEntityLinking engine to use
a Referenced-/ManagedSite that does use a ClerezzaYard, but I would
not recommend it, because lookups would be translated to SPARQL
queries. So you will most likely run into performance issues - even
with relatively small datasets. In addition you might also miss
expected results as SPARQL endpoints are missing full text search
features like tokenization, stemming, char folding ...

With the current components I would recommend - as you proposed - to
create an own Site that uses a SolrYard and copy over your RDF data.
If you want to do that in a batch process you can use the Entityhub
Indexing Tool. If you need to change single Entities as they do change
in your RDF data you should go for a ManagedSite.

FYI: The two-layered storage infrastructure (STANBOL-471) was intended
to be used in scenarios like that, but I have not seen much activity
from Anil on that in the recent time. With that the RDF graph would
have the role of the Store (I even started to implement Clerezza and
Jena TDB based store implementations (STANBOL-704)) and the Index
would be the ManagedSite with a SolrYard.

> 3) can anyone provide some details about the entityhub (not managed or
> referenced sites) ?

You can look at the Entityhub as the default managed site. It has some
special functionalities like support for importing Entities from other
Sites, but AFAIK those got never really adopted. However note that
Entities stored in the Enityhub can not be retrieved by using the
SiteManagers RESTful services ('entityhub/site' endpoint)


best
Rupert

>
> Best Regards
>
> Luigi
>
>
> 2013/5/20 Rupert Westenthaler <[email protected]>
>
>> On Mon, May 20, 2013 at 3:07 PM, Reto Bachmann-Gmür <[email protected]>
>> wrote:
>> > Thanks Rupert for these clarification.
>> >
>> > One thing that still isn't clear. You say that the EntityLinking engines
>> > operate on a single toke, while named entity tagging works on pharses.
>> What
>> > does this mean, I see that EntityLinking detects multiple word entities.
>> > What are the cases EntityLinking cannot handle?
>>
>> Yes EntityLinking tries to match several tokens with labels of
>> entities within the controlled vocabulary, but it still considers
>> single tokens as a potential "match".
>>
>> In contrast NamedEntityLinking would not allow a link for "Peter" if
>> "Peter Mustermann" was recognized as named Entity. Also the "Peter
>> Mustermann jun." would only be suggested for  "Peter Mustermann" in
>> that case. Even if the text would actually mention "Peter Mustermann
>> jun."
>>
>> best
>> Rupert
>>
>> >
>> > Cheers,
>> > Reto
>> >
>> >
>> > On Mon, May 20, 2013 at 2:05 PM, Rupert Westenthaler <
>> > [email protected]> wrote:
>> >
>> >> On Mon, May 20, 2013 at 12:34 PM, Reto Bachmann-Gmür <[email protected]>
>> >> wrote:
>> >> > Named Entity Tagging Engine: This creates entity references
>> exclusively
>> >> for
>> >> > substrings identied to denote a person, people or place by the named
>> >> entity
>> >> > recognizer.
>> >>
>> >> Correct. This Engine can use type restrictions based on the types
>> >> detected by NER when linking against the Vocabularies. In addition it
>> >> also searches for Entities matching the "phrase" detected as Named
>> >> Entities. The EntityLinking engine operates on single Tokens.
>> >>
>> >> >
>> >> > Entityhub Linking Engine: This creates the entity references using the
>> >> > results of NLP processing. Only some lexical categories are processed,
>> >> > these are determined by the parameter in "Processed Languages" as
>> well as
>> >> > with the "Link ProperNouns only".
>> >> >
>> >>
>> >> The Entityhub Linking Engine is a configuration of the
>> >> EntityLinkingEngine that uses the Entityhub to search for Entities in
>> >> the controlled vocabulary. It does not implement any linking
>> >> functionality itself.
>> >>
>> >>
>> >> > Keyword Linking Engine: "An engine that extracts keywords present
>> within
>> >> a
>> >> > Controlled Vocabulary mentioned within parsed ContentItem". I assumed
>> >> this
>> >> > would just link any matching word sequences without requiring any NLP
>> >> > (except word tokenization). However the config pane say that the
>> >> parameter
>> >> > "Min Token length" is ignored in case a POS (Part of Speech) tagger is
>> >> > available for the language of the parsed content. So is this using
>> NLP as
>> >> > well?
>> >> >
>> >>
>> >> This engine is deprecated. Its the predecessor of the Entity Linking
>> >> Engiine
>> >>
>> >>
>> >> > So this are the 3 Engines I find in the configuration. Then there's
>> also
>> >> > the EntityLinkingEngine according to
>> >> >
>> >>
>> https://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking
>> >> >
>> >>
>> >> This implements the Entity Linking process. To use it one needs to
>> >> provide implementations of the extension points (EntitySearcher and
>> >> LabelTokenizer).
>> >>
>> >> > Confusingly
>> https://stanbol.apache.org/docs/trunk/customvocabulary.html
>> >> > distinguishes
>> >> > between Named Entity Linking for which it refers to the Named Entity
>> >> > Tagging Engine and Keyword Linking for which it doesn't refer to the
>> >> > "Keyword Linking Engine" but to "Entityhub linking engine" (the
>> document
>> >> > has some issues: STANBOL-1075).
>> >>
>> >> "Keyword Linking" should no longer be used. "Named Entity Linking" and
>> >> "Entity Linking" are the preferred terms.
>> >>
>> >> You are right. The "Working with Custom Vocabularies" does have some
>> >> inconsistencies in the last part.  "2. Keyword Linking" should be "2.
>> >> Entity Linking" and also the 2nd heading "Configuring Named Entity
>> >> Linking" should note "Configuring Entity Linking" instead.
>> >>
>> >> best
>> >> Rupert
>> >>
>> >>
>> >> --
>> >> | Rupert Westenthaler             [email protected]
>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>> >> | A-5500 Bischofshofen
>> >>
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>



--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Confused about linking engines

Reply via email to