Hi Alfredo
On 22.03.2012, at 12:24, seralf wrote:
> Hi i'm new to stambol, i'm reading the documentation and examples, and i'd
> like to start some testing with it on italian language, if it's possible.
>
> Could someone give me some hint regarding the steps to try to costruct my
> model (Italian) and configure it inside the platform? I suppose it's
> possible and it should be not very far to the steps taken for construct
> -let's say- the Spanish integration.
> What i need to do? I know it could sound a very generic question, but it's
> not so clear from the documentation, so i need help.
> For my test i would like to be able to use a text corpora from the database
> of a client, and a skos thesaurus from the same domain.
>
> thanks in advance for every help (suggestions, code examples, ideas, etc)
>
In principle there are two different workflows how to extract Entities form Text
(1) NamedEntityExtraction (NER) [3] => NamedEntityLinking [4]
(2) KeywordLinking [5]
(1) requires a OpenNLP [1] NER model for the language of your documents.
However currently there are no models for the italian language distributed by
OpenNLP. This would require you to build your own models. For more information
on how to do that please see the documentation of OpenNLP [1]. As soon as you
have such models you need only copy them into the
{stanbol-workingdir}/sling/datafiles folder. If they follow the naming scheme
used by OpenNLP ("{lang}-ner-{type}.bin" e.g. "it-ner.location.bin" for the
model that detects locations for italian) Stanbol will pick them up
automatically.
(2) directly matches words of the text with labels of entities within the
controlled vocabulary. This process can be improved by Natural Langauge
Processing (e.g. Part-of-Speech tagging) but this is not a requirement.
Typically this works fine for datasets that contain named entities such as
concepts of an thesaurus; contacts of an company, projects, products … It does
not work well with datasets that contains entities with labels that are also
used as common words in the given language as this will result in a lot of
false positives.
Based on the information you provided on you use case I suggest that (2) should
work just fine for you. This user scenario [2] should provide you will all the
needed information on how to configure Stanbol for your use case.
I hope this helps. If you have any further questions feel free to ask
best
Rupert Westenthaler
[1] http://opennlp.apache.org/
[2] http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html
[3]
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/namedentityextractionengine.html
[4]
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/namedentitytaggingengine.html
[5]
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
> cheers,
> Alfredo Serafini