On 09/17/2013 08:55 AM, Giorgio Valoti wrote:
Hi all,
this is my first post to the list. I’ve tried to gather some info from the 
documentation and googling around but I haven’t found a satisfying answer to 
the following questions. Please tell me where to RTFM if some of these 
questions belong to some FAQ or are off-topic.

It seems there’s no way to incrementally train the POS tagger nor to 
parallelize this task. Is this correct?

If the only way to train the POS tagger is in one single shot, how can I 
estimate memory requirements for the JVM? In other words, given, say, a 1GB 
training corpus, is there a way to estimate how much RAM would it be needed?

Finally, I have tried to use the `-ngram` switch:
opennlp POSTaggerTrainer.conllx -type maxent -ngram 3 ... <other options as usual: 
-lang -model -data -encoding>
but I get this error:
Building ngram dictionary ... IO error while building NGram Dictionary: Stream 
not marked
Stream not marked
java.io.IOException: Stream not marked
         at java.io.BufferedReader.reset(BufferedReader.java:485)
         at 
opennlp.tools.util.PlainTextByLineStream.reset(PlainTextByLineStream.java:79)
         at 
opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
         at 
opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
         at 
opennlp.tools.cmdline.postag.POSTaggerTrainerTool.run(POSTaggerTrainerTool.java:80)
         at opennlp.tools.cmdline.CLI.main(CLI.java:222)

But I can’t find out what I’m doing wrong.


Looks like it tries to reset the stream, but that doesn't seem to work. Do you use 1.5.3?
Please open a jira issue for this so we can fix it.

Usually the pos tagger is trained without this ngram option, it is some old left-over experiment in the code which didn't turn
out to improve things.

If you have that much data you probably want to use the Two Pass Data Indxer, I am not sure if the pos tagger is doing that by default, a higher cutoff might help to reduce the required memory, and otherwise just try to give the process a couple of gigabytes of ram.

On which data set are you training? A gigabyte of training data is quite a lot ...

HTH,
Jörn

Reply via email to