On Wed, Jun 27, 2012 at 10:40 AM, Jörn Kottmann <[email protected]> wrote: > On 06/21/2012 05:31 PM, Nicolas Hernandez wrote: >> >> On Thu, Jun 21, 2012 at 10:00 AM, Jörn Kottmann <[email protected]> >> wrote: >>> >>> BTW, if you are training on the FrenchTreebank. >>> We have dedicated format support for it in the trunk, would be >>> easy to do the POS training with it. >> >> =) >> I didn't know. It is nice for opennlp. I am sorry I did not answered >> you about your invitation to integrate my own code. My approach was >> not dedicated to the parsing of the FrenchTreebank. So I could not >> integrate it easily. >> >> I ve tried the converter. I am not sure how to use it ? >> [2] gives a sentence per line with no pos tag associated with the tokens. >> >> Anyway, it is very tricky to choose what considering as tokens either >> compound or only simple words, or what pos tag to give to the tokens. >> I am not sure to understand well the choices which have been made in [1]. >> As soon as I manage to make the converter works, it will be more >> simple to see them. > > > The current implementation uses the same tag for all multi-word-unit > tokens. > For example: in_order_to/IN will be in/IN oder/IN to/IN. > > Would you use the pos tags as they are in the data? Maybe it would be useful > to add support for pos tag mappings. This would make it easy to experiment > with > different tag sets.
It is never easy to map tagsets since there is rarely a bijective relation and in addition, both process should assume the same tokenization. I made several experiences. Since I like to use the maltparser [1], I need now to adapt the ftb to a tag set called ftb+ as described by [2]. (only multi-word expressions which can be recognized by regular expression are considered, some pos tags result in the concatenation of the cat and subcat attributes...) I plan to do it by processing the MarkupAnnotations provided by the Tika MarkupAnnotator [3]. [1] http://www.maltparser.org/mco/french_parser/fremalt.html [2] P. Denis and B. Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proceedings of The Pacific Asia Conference on Language, Information and Computation (PACLIC 23), Hong Kong, China. [3] http://uima.apache.org/sandbox.html#tika.annotator > > Looks like we want to give the user some options on how these cases will be > handled. > > We need to add direct support for training the POS Tagger on this data. > Then you can do: > bin/opennlp POSTaggerConverter frenchtreebank ... > > There is a problem with the Parse object and the way the command line tools > are build. The cli tools assume that hey can serialize via toString a sample > object > into training data, but that does not work for the Parse object yet. To fix > that we need > to make a breaking API change and need to refactor some code in the coref > component. > Anyway it would be nice to get the parser trained on it as well. > > Jörn
