On 09/17/2013 08:55 AM, Giorgio Valoti wrote:
Hi all,
this is my first post to the list. I’ve tried to gather some info from the
documentation and googling around but I haven’t found a satisfying answer to
the following questions. Please tell me where to RTFM if some of these
questions belong to some FAQ or are off-topic.
It seems there’s no way to incrementally train the POS tagger nor to
parallelize this task. Is this correct?
If the only way to train the POS tagger is in one single shot, how can I
estimate memory requirements for the JVM? In other words, given, say, a 1GB
training corpus, is there a way to estimate how much RAM would it be needed?
Finally, I have tried to use the `-ngram` switch:
opennlp POSTaggerTrainer.conllx -type maxent -ngram 3 ... <other options as usual:
-lang -model -data -encoding>
but I get this error:
Building ngram dictionary ... IO error while building NGram Dictionary: Stream
not marked
Stream not marked
java.io.IOException: Stream not marked
at java.io.BufferedReader.reset(BufferedReader.java:485)
at
opennlp.tools.util.PlainTextByLineStream.reset(PlainTextByLineStream.java:79)
at
opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
at
opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43)
at
opennlp.tools.cmdline.postag.POSTaggerTrainerTool.run(POSTaggerTrainerTool.java:80)
at opennlp.tools.cmdline.CLI.main(CLI.java:222)
But I can’t find out what I’m doing wrong.
Looks like it tries to reset the stream, but that doesn't seem to work.
Do you use 1.5.3?
Please open a jira issue for this so we can fix it.
Usually the pos tagger is trained without this ngram option, it is some
old left-over experiment in the code which didn't turn
out to improve things.
If you have that much data you probably want to use the Two Pass Data
Indxer, I am not sure if the pos tagger is doing that by default,
a higher cutoff might help to reduce the required memory, and otherwise
just try to give the process a couple of gigabytes of ram.
On which data set are you training? A gigabyte of training data is quite
a lot ...
HTH,
Jörn