Hi Alfredo

On 22.03.2012, at 12:24, seralf wrote:

> Hi i'm new to stambol, i'm reading the documentation and examples, and i'd
> like to start some testing with it on italian language, if it's possible.
> 
> Could someone give me some hint regarding the steps to try to costruct my
> model (Italian) and configure it inside the platform? I suppose it's
> possible and it should be not very far to the steps taken for construct
> -let's say- the Spanish integration.
> What i need to do? I know it could sound a very generic question, but it's
> not so clear from the documentation, so i need help.
> For my test i would like to be able to use a text corpora from the database
> of a client, and a skos thesaurus from the same domain.
> 
> thanks in advance for every help (suggestions, code examples, ideas, etc)
> 

In principle there are two different workflows how to extract Entities form Text

(1) NamedEntityExtraction (NER) [3] => NamedEntityLinking [4]
(2) KeywordLinking [5]


(1) requires a OpenNLP [1] NER model for the language of your documents. 
However currently there are no models for the italian language distributed by 
OpenNLP. This would require you to build your own models. For more information 
on how to do that please see the documentation of OpenNLP [1]. As soon as you 
have such models you need only copy them into the 
{stanbol-workingdir}/sling/datafiles folder. If they follow the naming scheme 
used by OpenNLP ("{lang}-ner-{type}.bin" e.g. "it-ner.location.bin" for the 
model that detects locations for italian) Stanbol will pick them up 
automatically. 

(2) directly matches words of the text with labels of entities within the 
controlled vocabulary. This process can be improved by Natural Langauge 
Processing (e.g. Part-of-Speech tagging) but this is not a requirement. 
Typically this works fine for datasets that contain named entities such as 
concepts of an thesaurus; contacts of an company, projects, products … It does 
not work well with datasets that contains entities with labels that are also 
used as common words in the given language as this will result in a lot of 
false positives. 

Based on the information you provided on you use case I suggest that (2) should 
work just fine for you. This user scenario [2] should provide you will all the 
needed information on how to configure Stanbol for your use case.

I hope this helps. If you have any further questions feel free to ask

best
Rupert Westenthaler

[1] http://opennlp.apache.org/
[2] http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html

[3] 
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/namedentityextractionengine.html
[4] 
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/namedentitytaggingengine.html
[5] 
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html

> cheers,
> Alfredo Serafini

Reply via email to