Another clarification on this, just in case it is useful: The overall order of training in the chunking parser is "build, tagger, chunker and check" and you need to specify each of these steps as prefixes in a training parameters file. Like this, for example:
Algorithm=MAXENT build.Iterations=200 tagger.Iterations=200 chunker.Iterations=200 check.Iterations=200 build.Cutoff=4 tagger.Cutoff=4 chunker.Cutoff=4 check.Cutoff=4 build.Threads=4 tagger.Threads=4 chunker.Threads=4 check.Threads=4 Of course, if you insert a better POS model into the chunking-parse model you can just ignore the tagger parameters, etc. Cheers, Rodrigo On Thu, May 2, 2013 at 4:18 PM, Rodrigo Agerri <[email protected]> wrote: > Thanks Jörn, that worked. > > Just in case anyone is wondering about the 4 steps Jörn mentioned, I > looked at the chunking/Parser.java code again and found the reference > to the author of the parsing approach used by the chunker parser > (based on MaxEnt), whose thesis can be found here: > > http://www.ircs.upenn.edu/download/techreports/1998/98-15.pdf > > As the first two steps (tag and chunk, in this order) are already > provided by the training data you can configure the other two (build > and check, in this order) in the lang/TrainerParams.txt as you > suggested: > > build.Cuttoff=2 > build.Iterations=200 > build.Threads=4 > > check.Cuttoff=2 > check.Iterations=200 > check.Threads=4 > > Cheers, > > Rodrigo > > On Tue, Apr 30, 2013 at 9:46 PM, Joern Kottmann <[email protected]> wrote: >> Short answer from my phone, instead of Cutoff the parameter name is >> check.Cutoff=0 for example. I will have a closer look tomorrow and reply on >> the list, would be nice to have a sample parameter file for the parser be >> checked in. >> >> Cheers Jörn >> >> On Apr 30, 2013 7:50 PM, "Rodrigo Agerri" <[email protected]> wrote: >>> >>> Hi, >>> >>> Thanks for your answers, I will explain myself better. >>> >>> I edit the lang/TrainerParams.txt file where I specify, for example: >>> >>> Algorithm=MAXENT >>> Iterations=1000 >>> Cutoff=0 >>> Threads=4 >>> >>> Then I run the ParserTrainer from the CLI: >>> >>> bin/opennlp ParserTrainer -headRules >>> /home/ragerri/experiments/parsing/opennlp/es/data/es-head-rules >>> -parserType CHUNKING -params lang/TrainerParams.txt -lang es -model >>> test.bin -encoding UTF-8 -data >>> /home/ragerri/experiments/parsing/ancora-2.0/ancora2.treebank >>> >>> It trains fine, and the model works fine in a system using Apache >>> OpenNLP API, but it still uses the cutoff 5 and 100 iterations that >>> seems to be the default specification training parameters for >>> ParserTrainer. >>> >>> I can change these parameters for parser training using the API, that >>> works fine, but I cannot manage to do it from the command line. >>> >>> I did not understand your suggestion, Jörn, could you please provide >>> an example? >>> >>> Thanks, >>> >>> Rodrigo >>> >>> >>> >>> On Tue, Apr 30, 2013 at 4:21 PM, Jörn Kottmann <[email protected]> wrote: >>> > On 04/30/2013 04:03 PM, William Colen wrote: >>> >> >>> >> Are you using the command line tool? If yes, you should pass the path >>> >> to >>> >> the parameters file in the command line argument -params <file-path> >>> >> >>> > >>> > The parser trains multiple models, to make the parameters work they are >>> > prefixed, >>> > the prefixes for the four models are: tagger, chunker, check and build. >>> > Just >>> > put them in front >>> > of the usual parameter names. >>> > >>> > HTH, >>> > Jörn
