Once there was this project Multext, * MULTEXT (Multilingual Text Tools and Corpora) (1994) by Nancy Ide , Jean Véronis, COLING'94 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.9202 * project description http://aune.lpl.univ-aix.fr/projects/multext/LEX/LEX2.html ; sorry the description is in French. The idea was to define a uniq schema to cover all language specifities
On Thu, Jan 24, 2013 at 12:41 PM, Jörn Kottmann <[email protected]> wrote: > On 01/24/2013 11:59 AM, Renzo wrote: >> >> Hi all, >> I'm pretty new to OpenNLP. >> My interest is almost related to fetch document summaries using algorithms >> such as TextRank. >> This task requires sentence and token splitting - here's where OpenNLP >> enters the game. >> I also need some degree of POS to detect nouns, verbs and so on, in order >> to add some linguistic support to the ranking process. >> >> It was fairly surprising to discover that noun tags - for example - are >> language dependent. Thus an "isNoun" predicate needs a specific answer for >> each language. It's "NN" for English, but it may be different for others. >> >> I just wonder if there is a common (e.g. language-independent) way to >> answer such a kind of questions. >> >> Furthermore, is the logical format of available binary files documented >> anywhere ? Is there any way to browse those files to inspect the used tag >> list ? > > > > No, we did not write up a specification of our model formats. Tough, you can > find lots of information about it in various places. > All the models are zip files, which contain simple artifacts, e.g. xml > dictionary, etc and maxent models. You can find the > format explanation about the maxent models somewhere in maxent project, but > usually that is used like a black box, because > the model can't really be modified after training. > > Let us know if you have more questions about the formats, its probably > easier when we discuss it component by component, > depending on your needs. > > Tokenization, sentence splitting and the pos tagging are usually easy to get > to perform nicely, especially when you do some training. > The existing models are mostly trained on news articles and might not > perform that well on other domains. > > Jörn -- Dr. Nicolas Hernandez Associate Professor (Maître de Conférences) Université de Nantes - LINA CNRS UMR 6241 http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n +33 (0)2 51 12 53 94 +33 (0)2 40 30 60 67
