On 01/24/2013 11:59 AM, Renzo wrote:
Hi all,
I'm pretty new to OpenNLP.
My interest is almost related to fetch document summaries using algorithms such as TextRank. This task requires sentence and token splitting - here's where OpenNLP enters the game. I also need some degree of POS to detect nouns, verbs and so on, in order to add some linguistic support to the ranking process.

It was fairly surprising to discover that noun tags - for example - are language dependent. Thus an "isNoun" predicate needs a specific answer for each language. It's "NN" for English, but it may be different for others.

I just wonder if there is a common (e.g. language-independent) way to answer such a kind of questions.

Furthermore, is the logical format of available binary files documented anywhere ? Is there any way to browse those files to inspect the used tag list ?


No, we did not write up a specification of our model formats. Tough, you can find lots of information about it in various places. All the models are zip files, which contain simple artifacts, e.g. xml dictionary, etc and maxent models. You can find the format explanation about the maxent models somewhere in maxent project, but usually that is used like a black box, because
the model can't really be modified after training.

Let us know if you have more questions about the formats, its probably easier when we discuss it component by component,
depending on your needs.

Tokenization, sentence splitting and the pos tagging are usually easy to get to perform nicely, especially when you do some training. The existing models are mostly trained on news articles and might not perform that well on other domains.

Jörn

Reply via email to